Abstract
For this assignment, we will work with real data that include observations with historical credit applicants who is rated as Good or Bad credit (encoded as 1 and 0 respectively in the response variable). The goal is obtain a model that may be used to determine if new applicants present a high credit risk.For being able to do that, We will apply the CRISP-DM model and communicate the results in this report.Our main goal is to evaluate different models and choose the best within them to determine if new applicants represent a good or bad credit risk.
Under this context, we decided to use the methology named “Cross-industry standard process for data mining” (CRISP-DM). This model consists in six phases that naturally describes the data science life cycle. Below, you will find a picture that describe this process.
Figure 1: CRISP-DM Process
For this project, we will use the data GermanCredit.csv provided in the course Projects in Data Analytics for Decision Making given by Professor Jacques Zuber, which contains 1’000 observations about credit applicants, described by 30 variables.
data<-read.csv2(here::here("data/GermanCredit.csv"), dec=".", header=T)
In order to make a better analysis, we ask ourselves some questions that we will try to solve through the EDA and the applied models. We will seek to answer to these questions in the conclusions section:
The goal of this phase is to understand the project objetives and what is needed to be able to achieve them, then translate them into a data mining problem definition, which includes a designed plan for the analysis and the aplication we will follow step by step.
In general, the banking business have two main goals:
Both objectives generate an organic balance operation between the customers needs and the gains that the company requires for its operations. However, in this analysis, we will focus on dimishinig the risk linked to the second goal. In other words, we are looking for a good model to help us forecast which client will have a higher risk of not being able to pay back a loan that has been granted to them.
Our main goal will be to try to minimize the losses that are given by the sum of the amounts of credit that are given to the people that are predicted to be positive (hence being eligible for a credit) but that should have actually be forecasted as negative, as they determine a risk for the company of not being able to pay back the amount they received.
We want to achieve the goal of having the losses being smaller than the 10% of the total amount of the credit that would be granted to the customers.
Goal: Losses < 10% amount of credit
We will determine it by considering that the company will grant a credit only to those who have a good credit score, and it will not otherwise.
Figure 2: Goal of a credit
Now we will take into consideration some assumptions and make the list of the requirements and constraints that the project could have.
| Assess | List |
|---|---|
| Assumptions | 1) The team members have all the required skills. 2) The data is real. |
| Requirements | 1) Boundaries of the work: identify the best model among at least 5 2) Submit a report with our findings. |
| Constraints | 1) We have about 7 weeks to complete the analysis. 2) The size of the data set. 3) Limited input variables |
The proposed mining goal for this analysis is to obtain a model which shows a good evaluation in terms of risk of failure to refund, given the information of a new person, and hence hepling to decide whether it is a good idea to grant them the credit they are requiring. This algorithm needs to have a high accuracy, highlighting the negative impact of a false positive.
The specific goal for the different part of our analysis are the following:
In order to meet the objectives we have made a gantt chart, the time span considered was divided into 6 weeks from November 2 to December 18.
Figure 3: Gantt Project
We have followed the deadlines in order to obtain the corresponding feedback for each week.
Going forward with the anaylisis, the goal in the second step is to have a first perception of the information brought by the data and create hypotheses about them.
The dataset was delivered together with the description of the task and it is in csv format. It contains 1’000 observations, 1 each row, 30 input variables and 1 output variable. In addition to the dataset, we seek for more information in videos in youtube where we were mainly intended to familiarize with the operation of credit selection itself and, hence, indeep more into the variables that we consider that are bringing the most information.
In this point, we will examine the gross properties of the adquire data. Let’s start checking the stucture and size of it. As you can see below, there are 1’000 rows and 32 variables. The first column identifies all the observations taken into consideration with an unique ID (a number), the 30 following columns are the input variables, while the last one is the output variable, which gives the information regarding the person is a high risk (the credit is rejected) or not (the credit is accepted).
dim(data)
## [1] 1000 32
Now, let’s give a look to the summary and the structure of the data, including their statistical characteristics, for example, minimum, mean, maximum, so on.
Overview of the dataset (1000 observations):
As we can see from the graph, there is a majority of the observations having a positive value (700 against 300).
##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed mad min max
## OBS. 1 300 515.76 281.03 542.0 518.66 349.15 2 999
## CHK_ACCT 2 300 0.90 1.05 1.0 0.75 1.48 0 3
## DURATION 3 300 24.86 13.28 24.0 23.59 17.79 6 72
## HISTORY 4 300 2.17 1.08 2.0 2.19 0.00 0 4
## NEW_CAR 5 300 0.30 0.46 0.0 0.25 0.00 0 1
## USED_CAR 6 300 0.06 0.23 0.0 0.00 0.00 0 1
## FURNITURE 7 300 0.19 0.40 0.0 0.12 0.00 0 1
## RADIO.TV 8 300 0.21 0.41 0.0 0.13 0.00 0 1
## EDUCATION 9 300 0.07 0.26 0.0 0.00 0.00 0 1
## RETRAINING 10 300 0.11 0.32 0.0 0.02 0.00 0 1
## AMOUNT 11 300 3938.13 3535.82 2574.5 3291.18 2092.69 433 18424
## SAV_ACCT 12 300 0.67 1.30 0.0 0.34 0.00 0 4
## EMPLOYMENT 13 300 2.17 1.22 2.0 2.18 1.48 0 4
## INSTALL_RATE 14 300 3.10 1.09 4.0 3.25 0.00 1 4
## MALE_DIV 15 300 0.07 0.25 0.0 0.00 0.00 0 1
## MALE_SINGLE 16 300 0.49 0.50 0.0 0.48 0.00 0 1
## MALE_MAR_or_WID 17 300 0.08 0.28 0.0 0.00 0.00 0 1
## CO.APPLICANT 18 300 0.06 0.24 0.0 0.00 0.00 0 1
## GUARANTOR 19 300 0.03 0.18 0.0 0.00 0.00 0 1
## PRESENT_RESIDENT 20 300 2.85 1.09 3.0 2.94 1.48 1 4
## REAL_ESTATE 21 300 0.20 0.40 0.0 0.12 0.00 0 1
## PROP_UNKN_NONE 22 300 0.22 0.42 0.0 0.15 0.00 0 1
## AGE 23 300 33.96 11.22 31.0 32.38 8.90 19 74
## OTHER_INSTALL 24 300 0.25 0.44 0.0 0.19 0.00 0 1
## RENT 25 300 0.23 0.42 0.0 0.17 0.00 0 1
## OWN_RES 26 300 0.62 0.49 1.0 0.65 0.00 0 1
## NUM_CREDITS 27 300 1.37 0.56 1.0 1.29 0.00 1 4
## JOB 28 300 1.94 0.67 2.0 1.95 0.00 0 3
## NUM_DEPENDENTS 29 300 1.15 0.36 1.0 1.07 0.00 1 2
## TELEPHONE 30 300 0.38 0.49 0.0 0.35 0.00 0 1
## FOREIGN 31 300 0.01 0.11 0.0 0.00 0.00 0 1
## RESPONSE 32 300 0.00 0.00 0.0 0.00 0.00 0 0
## range skew kurtosis se
## OBS. 997 -0.08 -1.13 16.23
## CHK_ACCT 3 0.99 -0.27 0.06
## DURATION 66 0.83 0.03 0.77
## HISTORY 4 0.07 -0.09 0.06
## NEW_CAR 1 0.89 -1.22 0.03
## USED_CAR 1 3.82 12.60 0.01
## FURNITURE 1 1.55 0.39 0.02
## RADIO.TV 1 1.44 0.08 0.02
## EDUCATION 1 3.26 8.64 0.02
## RETRAINING 1 2.43 3.91 0.02
## AMOUNT 17991 1.57 2.05 204.14
## SAV_ACCT 4 1.83 1.82 0.08
## EMPLOYMENT 4 0.12 -0.96 0.07
## INSTALL_RATE 3 -0.72 -0.97 0.06
## MALE_DIV 1 3.46 9.98 0.01
## MALE_SINGLE 1 0.05 -2.00 0.03
## MALE_MAR_or_WID 1 3.00 7.02 0.02
## CO.APPLICANT 1 3.69 11.63 0.01
## GUARANTOR 1 5.17 24.85 0.01
## PRESENT_RESIDENT 3 -0.25 -1.40 0.06
## REAL_ESTATE 1 1.49 0.23 0.02
## PROP_UNKN_NONE 1 1.32 -0.25 0.02
## AGE 55 1.14 0.73 0.65
## OTHER_INSTALL 1 1.13 -0.73 0.03
## RENT 1 1.25 -0.43 0.02
## OWN_RES 1 -0.49 -1.76 0.03
## NUM_CREDITS 3 1.45 2.34 0.03
## JOB 3 -0.40 0.44 0.04
## NUM_DEPENDENTS 1 1.91 1.67 0.02
## TELEPHONE 1 0.51 -1.75 0.03
## FOREIGN 1 8.44 69.53 0.01
## RESPONSE 0 NaN NaN 0.00
## ------------------------------------------------------------
## group: 1
## vars n mean sd median trimmed mad min max
## OBS. 1 700 493.96 292.05 482.5 492.61 377.32 1 1000
## CHK_ACCT 2 700 1.87 1.23 2.0 1.96 1.48 0 3
## DURATION 3 700 19.21 11.08 18.0 17.88 8.90 4 60
## HISTORY 4 700 2.71 1.04 2.0 2.73 0.00 0 4
## NEW_CAR 5 700 0.21 0.41 0.0 0.13 0.00 0 1
## USED_CAR 6 700 0.12 0.33 0.0 0.03 0.00 0 1
## FURNITURE 7 700 0.18 0.38 0.0 0.09 0.00 0 1
## RADIO.TV 8 700 0.31 0.46 0.0 0.26 0.00 0 1
## EDUCATION 9 700 0.04 0.20 0.0 0.00 0.00 -1 1
## RETRAINING 10 700 0.09 0.29 0.0 0.00 0.00 0 1
## AMOUNT 11 700 2985.46 2401.47 2244.0 2564.20 1485.57 250 15857
## SAV_ACCT 12 700 1.29 1.65 0.0 1.11 0.00 0 4
## EMPLOYMENT 13 700 2.48 1.19 2.0 2.54 1.48 0 4
## INSTALL_RATE 14 700 2.92 1.13 3.0 3.02 1.48 1 4
## MALE_DIV 15 700 0.04 0.20 0.0 0.00 0.00 0 1
## MALE_SINGLE 16 700 0.57 0.49 1.0 0.59 0.00 0 1
## MALE_MAR_or_WID 17 700 0.10 0.29 0.0 0.00 0.00 0 1
## CO.APPLICANT 18 700 0.03 0.18 0.0 0.00 0.00 0 1
## GUARANTOR 19 700 0.06 0.25 0.0 0.00 0.00 0 2
## PRESENT_RESIDENT 20 700 2.84 1.11 3.0 2.93 1.48 1 4
## REAL_ESTATE 21 700 0.32 0.47 0.0 0.27 0.00 0 1
## PROP_UNKN_NONE 22 700 0.12 0.33 0.0 0.03 0.00 0 1
## AGE 23 700 36.30 11.77 34.0 34.92 10.38 19 125
## OTHER_INSTALL 24 700 0.16 0.36 0.0 0.07 0.00 0 1
## RENT 25 700 0.16 0.36 0.0 0.07 0.00 0 1
## OWN_RES 26 700 0.75 0.43 1.0 0.82 0.00 0 1
## NUM_CREDITS 27 700 1.42 0.58 1.0 1.35 0.00 1 4
## JOB 28 700 1.89 0.65 2.0 1.89 0.00 0 3
## NUM_DEPENDENTS 29 700 1.16 0.36 1.0 1.07 0.00 1 2
## TELEPHONE 30 700 0.42 0.49 0.0 0.39 0.00 0 1
## FOREIGN 31 700 0.05 0.21 0.0 0.00 0.00 0 1
## RESPONSE 32 700 1.00 0.00 1.0 1.00 0.00 1 1
## range skew kurtosis se
## OBS. 999 0.04 -1.23 11.04
## CHK_ACCT 3 -0.39 -1.53 0.05
## DURATION 56 1.18 1.38 0.42
## HISTORY 4 0.00 -0.90 0.04
## NEW_CAR 1 1.44 0.08 0.02
## USED_CAR 1 2.29 3.26 0.01
## FURNITURE 1 1.70 0.89 0.01
## RADIO.TV 1 0.81 -1.34 0.02
## EDUCATION 2 4.31 20.27 0.01
## RETRAINING 1 2.86 6.18 0.01
## AMOUNT 15607 1.94 4.62 90.77
## SAV_ACCT 4 0.76 -1.16 0.06
## EMPLOYMENT 4 -0.22 -0.87 0.04
## INSTALL_RATE 3 -0.45 -1.29 0.04
## MALE_DIV 1 4.50 18.32 0.01
## MALE_SINGLE 1 -0.30 -1.91 0.02
## MALE_MAR_or_WID 1 2.74 5.53 0.01
## CO.APPLICANT 1 5.23 25.39 0.01
## GUARANTOR 2 3.93 14.88 0.01
## PRESENT_RESIDENT 3 -0.28 -1.38 0.04
## REAL_ESTATE 1 0.78 -1.39 0.02
## PROP_UNKN_NONE 1 2.27 3.17 0.01
## AGE 106 1.43 4.51 0.45
## OTHER_INSTALL 1 1.88 1.54 0.01
## RENT 1 1.89 1.59 0.01
## OWN_RES 1 -1.17 -0.63 0.02
## NUM_CREDITS 3 1.20 1.30 0.02
## JOB 3 -0.37 0.50 0.02
## NUM_DEPENDENTS 1 1.89 1.59 0.01
## TELEPHONE 1 0.34 -1.89 0.02
## FOREIGN 1 4.26 16.21 0.01
## RESPONSE 0 NaN NaN 0.00
As we see in the table shown above, all the values are integers and there is not any missing value. In addition, we can see some inconsistencies of the variables with the initial description, that we will explain in following table.
| Variable | Description | Inconsistencies |
|---|---|---|
| CHK_ACCT | C: 0, 1, 2, 3 | X |
| DURATION | Numerical | X |
| HISTORY | C: 0, 1, 2, 3, 4 | X |
| NEW_CAR | B: 0, 1 | X |
| USED_CAR | B: 0, 1 | X |
| FURNITURE | B: 0, 1 | X |
| RADIO.TV | B: 0, 1 | X |
| EDUCATION | B: 0, 1 | ✔: Acoording to the description we should have a binary variable and the data show -1 |
| RETRAINING | B: 0, 1 | X |
| AMOUNT | Numerical | X |
| SAV_ACCT | C: 0, 1, 2, 3, 4 | X |
| EMPLOYMENT | C: 0, 1, 2, 3, 4 | X |
| INSTALL_RATE | Numerical | X |
| MALE_DIV | B: 0, 1 | X |
| MALE_SINGLE | B: 0, 1 | X |
| MALE_MAR_WID | B: 0, 1 | X |
| CO-APPLICANT | B: 0, 1 | X |
| GUARANTOR | B: 0, 1 | ✔: Acoording to the description we should have a binary variable and the data show a 2 |
| PRESENT_RESIDENT | C: 0, 1, 2, 3 | ✔: Acoording to the description we should 3 categories instead of 4 shown by the data |
| REAL_ESTATE | B: 0, 1 | X |
| PROP_UNKN_NONE | B: 0, 1 | X |
| AGE | Numerical | ✔: Identify outliers, the age should not go up to 125 years |
| OTHER_INSTALL | B: 0, 1 | X |
| RENT | B: 0, 1 | X |
| OWN_RES | B: 0, 1 | X |
| NUM_CREDITS | Numerical | X |
| JOB | C: 0, 1, 2, 3 | X |
| NUM_DEPENDENTS | Numerical | X |
| TELEPHONE | B: 0, 1 | X |
| FOREIGN | B: 0, 1 | X |
Then, for the 4 inconsistencies we have found, we have establish the following hypothesis and solutions.
The corrections will be made in the next section.
It is also important to mention that the output variable shows that in 70% of the cases the credit is accepted and in 30% rejected, which can later bias the prediction.
In this section, we will go deeper into the data and look for patterns or relationships between variables. To be able to do it, we will develop an histogram to check the distribution of our data for each variable.
## NULL
Regarding the last charts, we have the following observations:
Now that we have checked the distribution of the variables, let’s move on to the evaluation of their quartiles.
In the following table, you will find our principal observations.
| Boxplot | Observation |
|---|---|
| For each variable | We indentified that some variable could be mutually exclusive between them. We can evaluate the formation of the following groups: 1) Aggregation of the varibles purpose of the credit 2) Aggregation of the male variables as a categorical one. 3) Aggregation of the REAL_ESTATE and PROP_UNKN_NONE as a categorical one. 4) Aggregation of the RENT and OWN_RES as a categorical one. |
| plot by response | The variables which stand out more are: CHK_ACCT and EMPLOYMENT. We can observe that each box by group is different from the others. |
Additionally, we could identify some outliers, show as red dots and, as you can see in the second chart. The input variables clustered by the output variable show that there are some features that bring more information than others.
In the terms of the variables that we can create, we want to give some definitions, which are the following:
key definitions:
Now, we will give a look to the correlation between variables.
The variables which are the most correlated are the following:
In the model section we will evaluate the coefficients of the variables and continue this analysis in greater depth.
To be able to do so, we establish 3 questions that we will address during the resolution of this step:
## [1] 0
## [1] 1
## # A tibble: 1 x 4
## type cnt pcnt col_name
## <chr> <int> <dbl> <named list>
## 1 integer 31 100 <chr [31]>
Below you will find the answers of the 3 questions:
| Question | Answer |
|---|---|
| 1 | Yes, all the columns and rows contains information. |
| 2 | No, they contain some errors. They were found in data description and exploratory and are the following: 1) The variable EDUCATION shows an output that is not binary. 2) The variable PRESENT_RESIDENT have more categories than those that were mentioned in the description 3) the variable AGE is out of range. 4) AMOUNT, DURANTION, AGE are the variables with the highest quantity of outliers. |
| 3 | No, there is not any missing values in the data. |
In addition, we are going to anaylise if the aggregation mentioned in the boxplot for each variable is possible. To do it, we are going to apply the Chi-Squared Test to measure the independence between them. Next, we are going to explain the steps we will follow for each analysis:
1) Variables: REAL_ESTATE and PROP_UNKN_NONE
First, we establish the hypotheses:
\(H_0\): The REAL_ESTATE and PROP_UNKN_NONE are independent variables.
against the bilateral alternative:
\(H_1\): They are not independent.
For the chi-squared test to be valid, the following conditions must be true:
Assuptions: Significance level of 0.05 Clarifications: The p-value is the probability that a chi-square statistic having X degrees of freedom is more extreme than \(X^2\).
Finally, we are going to accept or reject the hypotheses checking the p-value. If the p-value is less than the significance level that we have chosen, we reject the null hypothesis. Thus, we conclude that there is a relationship between the variables.
2) RENT and OWN_RES
First, we stablish the hypotheses:
\(H_0\): The RENT and OWN_RES are independent variables.
against the bilateral alternative:
\(H_1\): They are not independent.
Then, we move on with the same process that we mentioned above.
The analysis can be found in the next chapter.
| Task | Output |
|---|---|
| Raise the data quality to the level required by the selected analysis techniques. This may involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by modeling. | Describe what decisions and actions were taken to address the data quality problems reported during the verify data quality task of the data understanding phase. Transformations of the data for cleaning purposes and the possible impact on the analysis results should be considered. |
We will consider how to correct the inconsistencies we have found in the previous chapter, which are on four different variables, namely:
We have already decided how to correct them, hence we will move on to that direction.
We will start by correcting the noises of EDUCATION, GUARANTOR and PRESENT_RESIDENT, by simply replacing the value -1 and 2 with value 1 for the first two and by changing the numbers of the categories for the latter by dimishing each value by 1, this is going to give us the true value corresponding to the data description that was given to us.
#EDUCATION
data %<>%
mutate(EDUCATION = replace(EDUCATION, EDUCATION == -1, 1))
#EDUCATION
data %<>%
mutate(GUARANTOR = replace(GUARANTOR, GUARANTOR == 2, 1))
#PRESENT_RESIDENT
data %<>%
mutate(PRESENT_RESIDENT = PRESENT_RESIDENT - 1)
This is specifically for the case of AGE. As previously said, we believe that the 125 age is an error, hence we will discard it by selecting only the observation with value lower then 76 (as 75 is the second highest value).
#AGE
data %<>%
filter(AGE < 76)
| Task | Output |
|---|---|
| This task includes constructive data preparation operations such as the production of derived attributes, entire new records or transformed values for existing attributes. | Derived attributes are new attributes that are constructed from one or more existing attributes in the same record. Examples: area = length * width. Describe the creation of completely new records. Example: create records for customers who made no purchase during the past year. There was no reason to have such records in the raw data, but for modeling purposes it might make sense to explicitly represent the fact that certain customers made zero purchases. |
As we already mentioned in the previous chapter, we will be able to create 4 different variables: 1. A binary variable decribing the sex of the person (male vs. female) 2. A categorical variable for the purpose of the credit 3. A categorical variable describing the real estate situation of the person (i.e. if someone owns a residence) 4. A categorical variable describing the property situation of the person (i.e. it they own their residence, are renting or something else)
We will start by the variable describing the sex of the considered person: this variable will be created thanks to the MALE_DIV, MALE_SINGLE and MALE_MAR_WID variables, and it will be a binary taking value 1 if the person is male, and value 0 if they are female.
More specifically, if either one of the variables used to construct the new one has value 1, so will the SEX_MALE variable, otherwise it will have value 0.
data %<>%
mutate(SEX_MALE = ifelse((MALE_DIV | MALE_SINGLE | MALE_MAR_or_WID) == 1, 1, 0)) %>%
mutate(SEX_MALE = as.factor(SEX_MALE))
We will now explore a bit the new variable we have created, by looking at the number of instances for each category and how it is affected in terms of response variable.
#Respesentation of SEX_MALE per value
data %>%
ggplot(aes(SEX_MALE)) +
geom_bar(aes(fill = factor(SEX_MALE))) +
theme(legend.position = "none") +
geom_label(stat = 'count', aes(label =..count..))
#Representation of output variable in terms of SEX_MALE
data %>%
ggplot(aes(RESPONSE)) +
geom_bar(aes(fill = factor(SEX_MALE)), position = "dodge")+
labs(color = "", fill = "SEX_MALE", x = "RESPONSE", y = "count")
We can see from the first graph that we have more observation with a positive value for the SEX_MALE variable (690 vs. 309), meaning that there are more men than women in the dataset.
Moreover, thanks to the second graph, we can see a difference on the positive value for the response having a male rather than a female, but this could also be due to the fact that the presence of male is higher with respect to female.
We will now move on to the other variables aforementioned, so that instead of having multiple dummy variables, we have factor variables with multiple levels.
Let’s start with the purpose of credit.
This variable will take the following values: 1 = the purpose for the credit was a new car 2 = the purpose for the credit was a used car 3 = the purpose for the credit was funriture 4 = the purpose for the credit was a radio or a television 5 = the purpose for the credit was to increase the level of eductation 6 = the purpose for the credit was a retraining 0 = the purpose for the credit was something else
It will be created by taking the respective value each time the dummy corresponding to one purpose, hence if one of them takes value 1, so will its level, if none of them has value 1, then the PURPOSE will take value 0.
data %<>%
mutate(PURPOSE = ifelse(NEW_CAR == 1, 1,
ifelse(USED_CAR == 1, 2,
ifelse(FURNITURE == 1, 3,
ifelse(RADIO.TV == 1, 4,
ifelse(EDUCATION == 1, 5,
ifelse(RETRAINING == 1, 6, 0))))))) %>%
mutate(PURPOSE = as.factor(PURPOSE))
Let’s have a look at the new variable, in terms of number of observations per level and its link to the response variable.
data %>%
ggplot(aes(PURPOSE)) +
geom_bar(aes(reorder(PURPOSE, -table(PURPOSE)[PURPOSE]), fill = PURPOSE)) +
scale_fill_discrete(name = "PURPOSE",
labels = c("OTHER", "NEW_CAR", "USED_CAR", "FURNITURE",
"RADIO/TV", "EDUCATION", "RETRAINING")) +
geom_label(stat = 'count', aes(label =..count..))
data %>%
ggplot(aes(RESPONSE)) +
geom_bar(aes(fill = factor(PURPOSE)), position = "dodge") +
labs(x = "RESPONSE", y = "count") +
scale_fill_discrete(name = "PURPOSE",
labels = c("OTHER", "NEW_CAR", "USED_CAR", "FURNITURE",
"RADIO/TV", "EDUCATION", "RETRAINING")) +
theme_bw()
In the first graph, we can see that the majority of the observations are found in the purpose of getting a Radio or a TV, followed by a new car and then furniture, while the one that is the less present is the education purpose.
In terms of output variable, shown in the second graph, the highest differences can be found in the Radio/TV and new car, but this could be given by the fact that they are the purposes with the highest number of observations.
Now, we will create the property variable.
We will start by looking at if the two variables that we want to use (namely, REAL_ESTATE and PROP_UNKN_NONE) are connected and hence it makes sense to put them together.
In order to do so, as we previously mentioned, we will perform a chi-squared independence test.
chisq.test(data$REAL_ESTATE, data$PROP_UNKN_NONE)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data$REAL_ESTATE and data$PROP_UNKN_NONE
## X-squared = 69.97, df = 1, p-value < 2.2e-16
We can see that the two variables are statistically significantly associated, as the p-value is really low, almost equal to 0, and hence is lower than the considered significance level of alpha = 5%.
We can conclude that it makes sense to merge the two variables into one factor, which will take value 1 if the person has a real estate, value 2 if the person is not known to have a property and value 0 otherwise.
data %<>%
mutate(PROPERTY = as.factor(ifelse(REAL_ESTATE == 1, 1,
ifelse(PROP_UNKN_NONE == 1, 2, 0))))
Let’s have a look also at this new variable, once again in terms of number of observations per level and if there is a difference of occurences given the output variable.
data %>%
ggplot(aes(PROPERTY)) + geom_bar(aes(fill = PROPERTY)) + scale_fill_discrete(name = "PROPERTY", labels = c("OTHER", "REAL_ESTATE", "PROP_UNKN_NONE")) + geom_label(stat = 'count', aes(label =..count..))
data %>%
ggplot(aes(RESPONSE)) + geom_bar(aes(fill = PROPERTY), position = "dodge") + scale_fill_discrete(name = "PROPERTY", labels = c("OTHER", "REAL_ESTATE", "PROP_UNKN_NONE"))
We can clearly see in the first graph that the majority of the observations do not have a clear value for the property, being equal to 0 (563 compared to the 282 of REAL_ESTATE and 154 of PROP_UNKN_NONE).
If we consider the response, hence the second graph, we cannot really see a difference from the person having a real estate or not having a property and having a credit rejected, while if they have a real estate it is more probable that they will get the credit compared to those who have not.
Now let’s look at the second variable that we wish to know if it is needed to be created.
This variable will be created using the RENT and OWN_RES variables to describe whether a person has a residence or not.
Let’s start once again by the chi-squared independence test.
chisq.test(data$RENT, data$OWN_RES)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data$RENT and data$OWN_RES
## X-squared = 536.77, df = 1, p-value < 2.2e-16
Also here, we can conclude that the two variables are statistically significantly associated, as the p-value is really low, especially as it is lower than the significance level we have chosen of alpha being equal to 5%.
Hence, we will create the residence variable, which will take value 1 if the person is renting, value 2 if the person is owning their own residence and value 0 otherwise.
data %<>%
mutate(RESIDENCE = as.factor(ifelse(RENT == 1, 1,
ifelse(OWN_RES == 1, 2, 0))))
And let’s explore the new variable a little bit, in terms of number of observations per level and if there is a difference in the possibility to get a credit given this information.
data %>%
ggplot(aes(RESIDENCE)) + geom_bar(aes(fill = RESIDENCE)) + scale_fill_discrete(name = "RESIDENCE", labels = c("OTHER", "RENT", "OWN_RES"))+ geom_label(stat = 'count', aes(label =..count..))
data %>%
ggplot(aes(RESPONSE)) + geom_bar(aes(fill = RESIDENCE), position = "dodge") + scale_fill_discrete(name = "RESIDENCE", labels = c("OTHER", "RENT", "OWN_RES"))
In the first graph, we can see that the majority of the people in the sample do own their own residence (712 observations, compared to the 108 of other and 179 who are renting).
Looking at the second graph, comparing it to the response variable, we can see that owning the residence seems to have an impact on the possibility to get the credit, while renting seems not to have a major impact.
| Task | Output |
|---|---|
| These are methods whereby information is combined from multiple tables or records to create new records or values. | Merging tables refers to joining together two or more tables that have different information about the same objects. Merged data also covers aggregations. Aggregation refers to operations where new values are computed by summarizing together information from multiple records and/or tables. |
Here we integrate the variables we created in the dataset and we discard the ones we used to create them, so that we avoid the problem of multicollinearity.
We will need to drop also one of the variables we used to create the SEX_MALE variable, for the same reason. The choice is on MALE_DIV.
We will also drop the identifier variable (OBS.) as it is not needed in the modelling part.
data_sel <- data %>%
dplyr::select(CHK_ACCT, DURATION, HISTORY, PURPOSE,
AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE,
SEX_MALE, MALE_SINGLE, MALE_MAR_or_WID,
CO.APPLICANT, GUARANTOR, PRESENT_RESIDENT,
PROPERTY, AGE, OTHER_INSTALL, RESIDENCE,
NUM_CREDITS, JOB, TELEPHONE, RESPONSE)
| Task | Output |
|---|---|
| Decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality and technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table. | List the data to be included/excluded and the reasons for these decisions. |
To further select the data we will use the correlation and we will run a simple linear model to have a look at which are the most important variables to be selected.
We start with the correlation and we use the basic dataset, because we cannot run a correlation on factor variables.
## V1
## PRESENT_RESIDENT -0.003059919
## NUM_DEPENDENTS 0.003296525
## MALE_MAR_or_WID 0.019844152
## FURNITURE -0.020669253
## JOB -0.033889427
## TELEPHONE 0.035704280
## RETRAINING -0.035923923
## NUM_CREDITS 0.046215841
## MALE_DIV -0.049924304
## GUARANTOR 0.055206089
## CO.APPLICANT -0.062607640
## EDUCATION -0.069954175
## INSTALL_RATE -0.073052339
## MALE_SINGLE 0.081465268
## AGE 0.089413005
## RENT -0.092509400
## NEW_CAR -0.098268291
## USED_CAR 0.100040026
## RADIO.TV 0.107374760
## OTHER_INSTALL -0.113009082
## EMPLOYMENT 0.117550263
## REAL_ESTATE 0.119759431
## PROP_UNKN_NONE -0.125508812
## OWN_RES 0.134228850
## AMOUNT -0.154366015
## SAV_ACCT 0.178079352
## DURATION -0.214326399
## HISTORY 0.229192869
## CHK_ACCT 0.352022485
We can see that, in general, the correlation between the output variable and the explanatory variable is not particularly high, having a maximum of 0.35 with CHCK_ACC and a minimum of -0.00306 with PRESENTE_RESIDENT.
We could decide to select only the variables having a correlation higher than a certain absolute value, however, as the difference among the correlations is not really high, and as we have used the basic dataset and not the one with the variables we have just created, we prefer not to make a selection here, and rather leave this decision to the modelling of a simple linear regression and a choice made on the AIC.
The Akaike information criterion (AIC) is a mathematical method for evaluating how well a model fits the data it was generated from. In statistics, AIC is used to compare different possible models and determine which one is the best fit for the data. source: https://www.scribbr.com/statistics/akaike-information-criterion/
The step function follows the idea that the variable that increases the AIC of the model the most will be discarder, up to the point in which it is not possible to decrease the AIC anymore.
set.seed(2143)
lm.sel <- glm(RESPONSE ~., data = data_sel)
lm.sel <- step(lm.sel, trace = 0)
summary(lm.sel)
##
## Call:
## glm(formula = RESPONSE ~ CHK_ACCT + DURATION + HISTORY + PURPOSE +
## AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_SINGLE +
## GUARANTOR + PROPERTY + OTHER_INSTALL + RESIDENCE + NUM_CREDITS +
## TELEPHONE, data = data_sel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.05164 -0.31768 0.08993 0.28791 0.83553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.944e-01 1.056e-01 6.576 7.88e-11 ***
## CHK_ACCT 9.306e-02 1.078e-02 8.632 < 2e-16 ***
## DURATION -5.070e-03 1.453e-03 -3.489 0.000506 ***
## HISTORY 6.796e-02 1.363e-02 4.984 7.35e-07 ***
## PURPOSE1 -1.213e-01 6.016e-02 -2.017 0.043987 *
## PURPOSE2 1.018e-01 6.854e-02 1.485 0.137931
## PURPOSE3 -1.316e-02 6.218e-02 -0.212 0.832417
## PURPOSE4 2.282e-03 5.959e-02 0.038 0.969465
## PURPOSE5 -1.558e-01 7.902e-02 -1.972 0.048920 *
## PURPOSE6 -1.599e-02 6.837e-02 -0.234 0.815149
## AMOUNT -1.719e-05 6.730e-06 -2.554 0.010801 *
## SAV_ACCT 3.303e-02 8.344e-03 3.959 8.08e-05 ***
## EMPLOYMENT 2.125e-02 1.106e-02 1.920 0.055087 .
## INSTALL_RATE -4.665e-02 1.279e-02 -3.648 0.000278 ***
## MALE_SINGLE 7.415e-02 2.757e-02 2.690 0.007267 **
## GUARANTOR 1.724e-01 5.859e-02 2.943 0.003330 **
## PROPERTY1 4.181e-02 3.058e-02 1.367 0.171816
## PROPERTY2 -9.679e-02 5.763e-02 -1.680 0.093348 .
## OTHER_INSTALL -8.740e-02 3.315e-02 -2.637 0.008506 **
## RESIDENCE1 -1.280e-01 6.941e-02 -1.844 0.065457 .
## RESIDENCE2 -5.565e-02 6.651e-02 -0.837 0.402923
## NUM_CREDITS -4.231e-02 2.489e-02 -1.700 0.089473 .
## TELEPHONE 4.960e-02 2.734e-02 1.814 0.069942 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1577717)
##
## Null deviance: 209.91 on 998 degrees of freedom
## Residual deviance: 153.99 on 976 degrees of freedom
## AIC: 1015
##
## Number of Fisher Scoring iterations: 2
data_sel <- lm.sel$model
Thanks to the AIC, we select the following variables: CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS and TELEPHONE, as they are the most significant. It is interesting to note that there are some levels of purpose that seem to be less relevant, more specifically the only one that are statistically significant are the first and the fifth. Moreover, we can see that there is coherence with the variables that had the highest correlations that we calculated before, hence we will use this method to make our final selection on the data.
| Task | Output |
|---|---|
| Formatting transformations refer to primarily syntactic modifications made to the data that do not change its meaning, but might be required by the modeling tool. | Some tools have requirements on the order of the attributes, such as the first field being a unique identifier for each record or the last field being the outcome field the model is to predict. It might be important to change the order of the records in the dataset. Perhaps the modeling tool requires that the records be sorted according to the value of the outcome attribute. Additionally, there are purely syntactic changes made to satisfy the requirements of the specific modeling tool. |
We will change the variables to factors for the dummies and the categorical variables, to have them corresponding to the description that has been given to us.
data_sel %<>%
mutate(
CHK_ACCT = as.factor(CHK_ACCT),
HISTORY = as.factor(HISTORY),
SAV_ACCT = as.factor(SAV_ACCT),
EMPLOYMENT = as.factor(EMPLOYMENT),
MALE_SINGLE = as.factor(MALE_SINGLE),
GUARANTOR = as.factor(GUARANTOR),
OTHER_INSTALL = as.factor(OTHER_INSTALL),
TELEPHONE = as.factor(TELEPHONE),
RESPONSE = as.factor(RESPONSE)
)
str(data_sel)
## 'data.frame': 999 obs. of 16 variables:
## $ RESPONSE : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CHK_ACCT : Factor w/ 4 levels "0","1","2","3": 1 2 4 1 1 4 4 2 4 2 ...
## $ DURATION : int 6 48 12 42 24 36 24 36 12 30 ...
## $ HISTORY : Factor w/ 5 levels "0","1","2","3",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ PURPOSE : Factor w/ 7 levels "0","1","2","3",..: 5 5 6 4 2 6 4 3 5 2 ...
## $ AMOUNT : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ SAV_ACCT : Factor w/ 5 levels "0","1","2","3",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ EMPLOYMENT : Factor w/ 5 levels "0","1","2","3",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ INSTALL_RATE : int 4 2 2 2 3 2 3 2 2 4 ...
## $ MALE_SINGLE : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 1 1 ...
## $ GUARANTOR : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
## $ PROPERTY : Factor w/ 3 levels "0","1","2": 2 2 2 1 3 3 1 1 2 1 ...
## $ OTHER_INSTALL: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ RESIDENCE : Factor w/ 3 levels "0","1","2": 3 3 3 1 1 1 3 2 3 3 ...
## $ NUM_CREDITS : int 2 1 1 1 2 1 1 1 1 2 ...
## $ TELEPHONE : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 1 ...
## - attr(*, "terms")=Classes 'terms', 'formula' language RESPONSE ~ CHK_ACCT + DURATION + HISTORY + PURPOSE + AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_SI| __truncated__ ...
## .. ..- attr(*, "variables")= language list(RESPONSE, CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE| __truncated__ ...
## .. ..- attr(*, "factors")= int [1:16, 1:15] 0 1 0 0 0 0 0 0 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:16] "RESPONSE" "CHK_ACCT" "DURATION" "HISTORY" ...
## .. .. .. ..$ : chr [1:15] "CHK_ACCT" "DURATION" "HISTORY" "PURPOSE" ...
## .. ..- attr(*, "term.labels")= chr [1:15] "CHK_ACCT" "DURATION" "HISTORY" "PURPOSE" ...
## .. ..- attr(*, "order")= int [1:15] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(RESPONSE, CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE| __truncated__ ...
## .. ..- attr(*, "dataClasses")= Named chr [1:16] "numeric" "numeric" "numeric" "numeric" ...
## .. .. ..- attr(*, "names")= chr [1:16] "RESPONSE" "CHK_ACCT" "DURATION" "HISTORY" ...
The selected dataset will hence have 999 observations of 16 different variables, 15 of which are the independent variables, 4 of which are continuous variable (i.e.: DURATION,AMOUNT,INSTALL_RATE and NUM_CREDITS) and the remaining are all categorical or dummy variables. The first variable is the output (i.e. RESPONSE), which is also a dummy.
We are now ready to move on with the modelling part of our analysis.
The modelling technique that we will be using are the following:
| n° | Model | Definition |
|---|---|---|
| 1 | Logistic regression |
> Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression). (https://en.wikipedia.org/wiki/Logistic_regression) |
| 2 | Decision trees |
> A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements. (https://en.wikipedia.org/wiki/Decision_tree) |
| 3 | Discriminate analysis |
> Discriminant analysis is statistical technique used to classify observations into non-overlapping groups, based on scores on one or more quantitative predictor variables. (https://stattrek.com/multiple-regression/discriminant-analysis.aspx) |
| 4 | Random forest |
> Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees. (https://en.wikipedia.org/wiki/Random_forest) |
| 5 | Neural network |
> A neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes. (https://en.wikipedia.org/wiki/Neural_network) |
| 6 | XGBoost | > XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. (https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/) |
In order to compare the 6 models shown above, we will mainly use the CARET package for each algorithm.
\(H_ {0}\): The \(Model_n\) give the best accuracy and sensitivity.
\(H_ {1}\): It do not give the best values.
Where \(n= (1,2,3,4,5,6)\) and it represents each listed model in the selection technique part.
To be able to generate the model, first, we need to standardize the data, as the variables have different scales. Nevertheless, we will normalize only the continuous variables, as the categorical and dummy variables have only few different levels.
Now that the normalization is done, lets move on by creation of the training and test set based on the data.This will be done by dividing it in a randomly selection into the two subsets, with 75% of the data in the training set and the remaining 25% in the test set.
As you can see above, the data have the same proportion in the dataset, the training and the test set. Specifically, in all of them the dependent variable is biased, since it shows a greater tendency for a positive response. For this reason we will evaluate two fits for each algorithm, one with the skewed data and the other with a balanced one. Finally, to be able to compare them we are going to compute the confusion matrix, which includes the following information:
In synthesis, the sensitivity measures the true positive rate, which is key for this project, since a false positive has a negative impact on our main objective, as it would increase the risk of not being able to refund agreed payments. Meaning that in addition to balancing the data we will focus the second model on maximizing sensitivity.
The general equation for the model is:
\[ Z_{i} = ln(\frac{P_{i}} {1-P_{i}}) = \beta_0+\beta_1X_1+...+\beta_nX_n \]
For the application of the algorithm we will apply the following steps:
| Data set | Steps |
|---|---|
| Unbalanced data | As we have sees earlier, the output variable is unbalanced. We are going to evaluate the accuracy and the sensitivity of the model, with the following steps: 1) Fit the model. 2) Coefficient analysis 3) Predict . 4) Confusion matrix . |
| Balanced data | In this step, we are going to balance the data with the training.control function and, then, we will evaluate the accuracy and the sensitivity of the model, with the following steps: 5) Fit the model. 6) Predict 7) Confusion matrix |
#Same division
set.seed(1234)
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, repeats=5)
#10-Fold Cross Validation #5 repetitions
mod_lg_fit <- caret::train(RESPONSE ~ ., TrainData, method="glm",
family="binomial",trControl= train_params)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7192 -0.7199 0.3843 0.7077 2.3350
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.93510 0.81482 -1.148 0.25113
## CHK_ACCT1 0.25725 0.24875 1.034 0.30106
## CHK_ACCT2 1.15025 0.44137 2.606 0.00916 **
## CHK_ACCT3 1.65679 0.25994 6.374 1.84e-10 ***
## DURATION -0.33249 0.12474 -2.665 0.00769 **
## HISTORY1 -0.35513 0.60991 -0.582 0.56039
## HISTORY2 0.51189 0.48162 1.063 0.28784
## HISTORY3 0.71627 0.52256 1.371 0.17047
## HISTORY4 1.41024 0.49046 2.875 0.00404 **
## PURPOSE1 -0.92834 0.42913 -2.163 0.03052 *
## PURPOSE2 0.83586 0.54279 1.540 0.12358
## PURPOSE3 -0.08614 0.44419 -0.194 0.84624
## PURPOSE4 0.05502 0.43423 0.127 0.89917
## PURPOSE5 -0.73150 0.57782 -1.266 0.20553
## PURPOSE6 -0.08247 0.49176 -0.168 0.86682
## AMOUNT -0.32341 0.14085 -2.296 0.02167 *
## SAV_ACCT1 0.45888 0.33152 1.384 0.16630
## SAV_ACCT2 0.24327 0.45461 0.535 0.59257
## SAV_ACCT3 0.70072 0.55496 1.263 0.20671
## SAV_ACCT4 1.31133 0.31892 4.112 3.93e-05 ***
## EMPLOYMENT1 0.25345 0.41874 0.605 0.54499
## EMPLOYMENT2 0.63919 0.39563 1.616 0.10617
## EMPLOYMENT3 1.11825 0.43882 2.548 0.01082 *
## EMPLOYMENT4 0.72410 0.41270 1.755 0.07933 .
## INSTALL_RATE -0.35899 0.11182 -3.210 0.00133 **
## MALE_SINGLE1 0.54316 0.20976 2.590 0.00961 **
## GUARANTOR1 0.89175 0.48361 1.844 0.06519 .
## PROPERTY1 0.06645 0.23901 0.278 0.78099
## PROPERTY2 -0.53913 0.41424 -1.301 0.19309
## OTHER_INSTALL1 -0.55776 0.24600 -2.267 0.02337 *
## RESIDENCE1 -0.77956 0.50088 -1.556 0.11962
## RESIDENCE2 -0.27558 0.47140 -0.585 0.55883
## NUM_CREDITS -0.11808 0.12232 -0.965 0.33440
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 916.30 on 749 degrees of freedom
## Residual deviance: 682.31 on 717 degrees of freedom
## AIC: 748.31
##
## Number of Fisher Scoring iterations: 5
In this step, we can see that the variables that take the highest importance and that are statistically significant for the model are: the second and third level of CHK_ACC, DURATION, the fourth level of HISTORY,the first level of PURPOSE, AMOUNT,the fourth level of SAV_ACCT and the third level of EMPLOYMENT and the first level of OTHER_INSTALL.
| . | |
|---|---|
| CHK_ACCT1 | FALSE |
| CHK_ACCT2 | TRUE |
| CHK_ACCT3 | TRUE |
| DURATION | TRUE |
| HISTORY1 | FALSE |
| HISTORY2 | FALSE |
| HISTORY3 | FALSE |
| HISTORY4 | TRUE |
| PURPOSE1 | TRUE |
| PURPOSE2 | FALSE |
| PURPOSE3 | FALSE |
| PURPOSE4 | FALSE |
| PURPOSE5 | FALSE |
| PURPOSE6 | FALSE |
| AMOUNT | TRUE |
| SAV_ACCT1 | FALSE |
| SAV_ACCT2 | FALSE |
| SAV_ACCT3 | FALSE |
| SAV_ACCT4 | TRUE |
| EMPLOYMENT1 | FALSE |
| EMPLOYMENT2 | FALSE |
| EMPLOYMENT3 | TRUE |
| EMPLOYMENT4 | FALSE |
| INSTALL_RATE | TRUE |
| MALE_SINGLE1 | TRUE |
| GUARANTOR1 | FALSE |
| PROPERTY1 | FALSE |
| PROPERTY2 | FALSE |
| OTHER_INSTALL1 | TRUE |
| RESIDENCE1 | FALSE |
| RESIDENCE2 | FALSE |
| NUM_CREDITS | FALSE |
If we look at the coeffiecients of the different variables we can conclude that, among the significant one that we described before, CHK_ACCT, HISTORY (all but the first level), SAV_ACCT, EMPLOYMENT and MALE_SINGLE, have a positive impact on the output, meaning that the higher is their level, or if they are positive, the probability of having RESPONSE = 1 will increase.
On the other hand, among the significant variables, DURATION, PURPOSE (all but level two and four), AMOUNT and OTHER_INSTALL have a negative effect on the output, meaning that if they increase their level or value, or if they have a positive value (for the dummies), the probability of having a positive response will decrease.
The linear predictor is given by \[ \eta = - 0.9 + 0.3 * CHKACCT_1 + 1.2 * CHKACCT_2 + 1.7 * CHKACCT_3 - 0.3 * DURATION - 0.4 * HISTORY_1 + 0.5 * HISTORY_2 + 0.7 * HISTORY_3 + 1.4 * HISTORY_4 - 0.9 * PURPOSE_1 + 0.8 * PURPOSE_2 - 0.08 * PURPOSE_3 + 0.05 * PURPOSE_4 - 0.7 * PURPOSE_5 - 0.08 * PURPOSE_6 - 0.3 * AMOUNT - 0.5 * SAVACCT_1 + 0.2 * SAVACCT_2 + 0.7 * SAVACCT_3 + 1.3 * SAVACCT_4 + 0.25 * EMPLOYMENT_1 + * 0.6 EMPLOYMENT_2 + 1.1 * EMPLOYMENT_3 + 0.7 * EMPLOYMENT_4 - 0.4 * INSTALLRATE + 0.5 * MALESINGLE_1 + 0.9 * GUARANTOR_1 + 0.06 * PROPERTY_1 - 0.5 * PROPERTY_2 - 0.6 * OTHERINSTALL_1 - 0.8 * RESIDENCE_1 - 0.3 * RESIDENCE_2 - 0.1 * NUMCREDITS + 0.4426 * TELEPHONE_1 \]
To be clear, if for example the purpose variable takes value 3, only the coefficient of PURPOSE_3 will be added to the others. The same goes for each other categorical variable. For the dummies the coefficient is added only if the value is equal to 1 for the variable, otherwise no. While for the continuous variables the coefficient is multiplied by the value that is recorded in the observation.
Now we will get the predictions using this model. To being able to do it, we will start by getting the probabilities of the output given the coefficients we have found by fitting the model, then we will use a cut point of 0.5 to decide whether the value will be equal to 1 (if the probability it higher than 0.5) or 0 (otherwise). The model basically fit the information of the new observation in the function that is given above, and then it finds a value eta that is then used to get the prediction of the probability of the output by doint p = 1 / (1 + eta), which we will use to determine the class predicted for the outcome.
#prediction given the model
lg.pred <- predict(mod_lg_fit, newdata = TestData)
The unbalance towards the positive value prediction is more than clear, in this graph.
The sensitivity is quite low, at 52%, the specificity, though, is high, almost 90%, while the accuracy is at 78%. The number of false positive is quite high, being 36.
#Same division
set.seed(1234)
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10,
repeats=5, sampling = "down")
mod_lg_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="glm",
family="binomial",
metric = "Sens", #optimize sensitivity
maximize = TRUE, #maximize the metric
trControl= train_params)
################check outputs################################vv
summary(mod_lg_fitbalance)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.22739 -0.78261 -0.02752 0.79604 2.53360
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8770 1.0650 -0.823 0.410229
## CHK_ACCT1 0.3269 0.3181 1.028 0.304055
## CHK_ACCT2 1.3096 0.5156 2.540 0.011079 *
## CHK_ACCT3 1.9754 0.3241 6.096 1.09e-09 ***
## DURATION -0.2847 0.1556 -1.830 0.067224 .
## HISTORY1 -0.5277 0.7045 -0.749 0.453894
## HISTORY2 -0.3269 0.5422 -0.603 0.546492
## HISTORY3 -0.1523 0.5993 -0.254 0.799437
## HISTORY4 0.8379 0.5514 1.519 0.128658
## PURPOSE1 -1.4030 0.5170 -2.714 0.006656 **
## PURPOSE2 0.4736 0.6667 0.710 0.477406
## PURPOSE3 -0.6911 0.5304 -1.303 0.192610
## PURPOSE4 -0.6082 0.5162 -1.178 0.238638
## PURPOSE5 -0.7442 0.6780 -1.098 0.272320
## PURPOSE6 -0.7298 0.5839 -1.250 0.211399
## AMOUNT -0.2435 0.1755 -1.387 0.165326
## SAV_ACCT1 0.5952 0.4076 1.460 0.144271
## SAV_ACCT2 -0.2930 0.5523 -0.530 0.595804
## SAV_ACCT3 0.4723 0.6553 0.721 0.471134
## SAV_ACCT4 1.2719 0.3723 3.417 0.000634 ***
## EMPLOYMENT1 0.8801 0.5906 1.490 0.136204
## EMPLOYMENT2 1.2155 0.5581 2.178 0.029427 *
## EMPLOYMENT3 1.7196 0.6056 2.839 0.004519 **
## EMPLOYMENT4 1.0968 0.5823 1.883 0.059652 .
## INSTALL_RATE -0.3919 0.1387 -2.825 0.004730 **
## MALE_SINGLE1 0.7278 0.2648 2.749 0.005983 **
## GUARANTOR1 0.4654 0.6556 0.710 0.477814
## PROPERTY1 0.2035 0.2913 0.699 0.484762
## PROPERTY2 -1.0469 0.5950 -1.760 0.078487 .
## OTHER_INSTALL1 -0.7800 0.3200 -2.438 0.014775 *
## RESIDENCE1 -0.9846 0.6864 -1.435 0.151419
## RESIDENCE2 -0.7140 0.6671 -1.070 0.284474
## NUM_CREDITS -0.1793 0.1529 -1.173 0.240747
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 623.83 on 449 degrees of freedom
## Residual deviance: 449.68 on 417 degrees of freedom
## AIC: 515.68
##
## Number of Fisher Scoring iterations: 5
#probability given the model
lg.pred.b <- predict(mod_lg_fitbalance, newdata = TestData)
Contrary to the output of the first model, we can see that the proportion of the prediction is better in the balanced case.
The sensitivity is 72%, the specificity is 64% and the accuracy only 66%. The number of false positive is equal to 21.
The next image illustrates better the way of working of decision trees.
Figure 4: A caption
The process consists in the minimization of the classification error rate:
\[E=\ 1-max_{k}(p_{mk})\] where \(p_{mk}\) is the proportion of training observation.
For the application of the algorithm we will apply the following steps:
| Data set | Steps |
|---|---|
| Unbalanced data | As we have seen earlier, the output variable is unbalanced. We are going to evaluate the accuracy and the sensitivity of the model, with the following steps: 1) Fit the model. 2) Plot the best tree 3) Predict . 4) Confusion matrix . |
| Balanced data | In this step, we are going to balance the data with the training.control function and, then, we will evaluate the accuracy and the sensitivity of the model, with the following steps: 5) Fit the model. 6) Plot the best tree 6) Predict 7) |
| Confusion matrix |
We will start by fitting the model on the data.
#Same division
set.seed(1234)
#########################model######################################
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5)
#10-Fold Cross Validation #5 repetions
mod_dt_fit <- caret::train(RESPONSE ~ ., TrainData, method="rpart",
trControl= train_params)
## var n wt dev yval
## <leaf> :8 Min. : 14.0 Min. : 14.0 Min. : 3.0 Min. :1.0
## AMOUNT :2 1st Qu.: 34.0 1st Qu.: 34.0 1st Qu.: 9.0 1st Qu.:1.0
## DURATION :2 Median :126.0 Median :126.0 Median : 42.0 Median :2.0
## CHK_ACCT3:1 Mean :185.1 Mean :185.1 Mean : 62.8 Mean :1.6
## PROPERTY2:1 3rd Qu.:252.0 3rd Qu.:252.0 3rd Qu.: 84.0 3rd Qu.:2.0
## SAV_ACCT4:1 Max. :750.0 Max. :750.0 Max. :225.0 Max. :2.0
## (Other) :0
## complexity ncompete nsurrogate
## Min. :0.000000 Min. :0.000 Min. :0.0
## 1st Qu.:0.002222 1st Qu.:0.000 1st Qu.:0.0
## Median :0.019259 Median :0.000 Median :0.0
## Mean :0.015605 Mean :1.867 Mean :0.8
## 3rd Qu.:0.026667 3rd Qu.:4.000 3rd Qu.:0.5
## Max. :0.026667 Max. :4.000 Max. :5.0
##
## yval2.V1 yval2.V2 yval2.V3 yval2.V4 yval2.V5 yval2.nodeprob
## Min. :1.0 Min. : 3.00000 Min. : 4.0 Min. :0.1174497 Min. :0.1428571 Min. :0.0186667
## 1st Qu.:1.0 1st Qu.: 26.50000 1st Qu.: 15.5 1st Qu.:0.3346560 1st Qu.:0.3971861 1st Qu.:0.0453333
## Median :2.0 Median : 42.00000 Median : 60.0 Median :0.4285714 Median :0.5714286 Median :0.1680000
## Mean :1.6 Mean : 67.13333 Mean :118.0 Mean :0.4620491 Mean :0.5379509 Mean :0.2468444
## 3rd Qu.:2.0 3rd Qu.: 84.50000 3rd Qu.:167.5 3rd Qu.:0.6028139 3rd Qu.:0.6653440 3rd Qu.:0.3360000
## Max. :2.0 Max. :225.00000 Max. :525.0 Max. :0.8571429 Max. :0.8825503 Max. :1.0000000
##
#prediction given the model
dt.pred <- predict(mod_dt_fit, newdata = TestData) #predict give me the probability i am looking for the the binomial answer
The prediction is clearly biased to a positive answer.
We can see that here the sensitivity is really low, while the specificity is higher, reaching a value above 92%, which is in any case the one in which we are the most interested. The accuracy is around 74%. Here, in 52 cases in which the model should have given a negative value, it actually predicted a positive one, and it could cost quite a lot to the company.
#Same division
set.seed(1234)
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10,
repeats=5, sampling = "down")
mod_dt_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="rpart",
metric = "Sens", #optimize sensitivity
maximize = TRUE, #maximize the metric
trControl= train_params)
## var n wt dev yval
## <leaf> :6 Min. : 8.0 Min. : 8.0 Min. : 1.00 Min. :1.000
## AMOUNT :3 1st Qu.: 52.0 1st Qu.: 52.0 1st Qu.: 12.00 1st Qu.:1.000
## CHK_ACCT3:1 Median :174.0 Median :174.0 Median : 58.00 Median :1.000
## SAV_ACCT4:1 Mean :166.8 Mean :166.8 Mean : 61.55 Mean :1.364
## CHK_ACCT1:0 3rd Qu.:227.5 3rd Qu.:227.5 3rd Qu.: 80.00 3rd Qu.:2.000
## CHK_ACCT2:0 Max. :450.0 Max. :450.0 Max. :225.00 Max. :2.000
## (Other) :0
## complexity ncompete nsurrogate
## Min. :0.000000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.005556 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.013333 Median :0.000 Median :0.0000
## Mean :0.045657 Mean :1.818 Mean :0.5455
## 3rd Qu.:0.022222 3rd Qu.:4.000 3rd Qu.:0.5000
## Max. :0.364444 Max. :4.000 Max. :3.0000
##
## yval2.V1 yval2.V2 yval2.V3 yval2.V4 yval2.V5 yval2.nodeprob
## Min. :1.0000000 Min. : 1.00000 Min. : 6.00000 Min. :0.0833333 Min. :0.1492537 Min. :0.0177778
## 1st Qu.:1.0000000 1st Qu.: 24.50000 1st Qu.: 17.00000 1st Qu.:0.3141892 1st Qu.:0.3424908 1st Qu.:0.1155556
## Median :1.0000000 Median :116.00000 Median : 64.00000 Median :0.6134021 Median :0.3865979 Median :0.3866667
## Mean :1.3636364 Mean : 95.72727 Mean : 71.09091 Mean :0.5030050 Mean :0.4969950 Mean :0.3707071
## 3rd Qu.:2.0000000 3rd Qu.:147.50000 3rd Qu.: 96.50000 3rd Qu.:0.6575092 3rd Qu.:0.6858108 3rd Qu.:0.5055556
## Max. :2.0000000 Max. :225.00000 Max. :225.00000 Max. :0.8507463 Max. :0.9166667 Max. :1.0000000
##
We can see that it is a bit more balanced, even if the number of positive predictions is still higher.
We can see that here the sensitivity has improved, however the specificity is lower, reaching a value above 63%, which is in any case the one in which we are the most interested. The accuracy is around 60%.
There are four types of discriminate analysis, we will explain them in the following table:
| n° | Model | Definition |
|---|---|---|
| 1 | LDA | > Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher’s linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. (https://en.wikipedia.org/wiki/Linear_discriminant_analysis) |
| 2 | QDA | >This method assume that the measurements from each class are normally distributed, but there is not assumption saying that the covariance of each of the classes is identical.When the normality assumption is true, the best possible test for the hypothesis that a given measurement is from a given class is the likelihood ratio test. (https://en.wikipedia.org/wiki/Quadratic_classifier) |
| 3 | FDA | > It analyzes data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework each sample element is considered to be a function. (https://en.wikipedia.org/wiki/Functional_data_analysis) |
| 4 | MDA | > It is a multivariate dimensionality reduction technique. It has been used to predict signals as diverse as neural memory traces and corporate failure. (https://en.wikipedia.org/wiki/Multiple_discriminant_analysis) |
The next image illustrates better the way of working for each model.
Steps for the application of the algorithm:
| Data set | Steps |
|---|---|
| Unbalanced data | 1) Fit the model. 2) Predict . 3) Confusion matrix . |
| Balanced data | 4) Fit the model. 5) Predict 6) Confusion matrix |
#Same division
set.seed(1234)
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, repeats=5)
#K-Fold Cross Validation
mod_lda_fit <- caret::train(RESPONSE ~ ., TrainData, method="lda",
family="binomial",trControl= train_params)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 64 -none- numeric
## scaling 32 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 4 -none- call
## xNames 32 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 1 -none- list
The linear combination of predictor variables that are used to form the decision rule is the following:
\[ RESPONSE = -0.3265 * DURATION -0.2747 * HISTORY_1 + 0.8792 * HISTORY_2 + 1.1810 * HISTORY_3 + 1.6214 * HISTORY_4 - 0.7437 * PURPOSE_1 + 0.7736 * PURPOSE_2 + 0.0172 * PURPOSE_3 + 0.2035 * PURPOSE_4 - 0.6116 * PURPOSE_5 + 0.1298 * PURPOSE_6 - 0.2579 * AMOUNT + 0.5066 * SAV_ACCT_1 + 0.7517 * SAV_ACCT_2 + 0.7778 * SAV_ACCT_3 + 1.0997 * SAV_ACCT_4 + 0.6175 * EMPLOYMENT_1 + 1.1982 * EMPLOYMENT_2 + 1.4580 * EMPLOYMENT_3 + 1.1806 * EMPLOYMENT_4 - 0.2897 * INSTALL_RATE - 0.5663 * SEX_MALE_1 + 0.9272 * MALE_SINGLE_1 + 0.5183 * MALE_MAR_WID_1 - 0.0863 * CO_APPLICANT_1 + 0.5084 * GUARANTOR_1 - 0.2437 * PRESENT_RESIDENT_-1 - 0.1913 * PRESENT_RESIDENT_0 - 0.0477 * PRESENT_RESIDENT_1 + 0.0367 * PROPERTY_1 - 0.6374 * PROPERTY_2 + 0.0502 * AGE - 0.4501 * OTHER_INSTALL_1 - 0.7509 * RESIDENCE_1 - 0.2185 * RESIDENCE_2 - 0.0703 * NUM_CREDITS - 1.0570 * JOB_1 - 1.0581 * JOB_2 - 0.8362 * JOB_3\]
Each new observation will be evaluated thanks to this formula, with its information put inside of it. It follows the same principle described for the generalized linear model.
lda.pred <- predict(mod_lda_fit, newdata = TestData) #predict give me the probability i am looking for the the binomial answer
With this graph we confirm what has been explained in the fit part, which is the prediction tending to give a positive response.
Here the sensitivity is higher with respect to the previous unbalance models (above 40%), but it is still quite low. If we look at the accuracy, is quite low, as it is only around 78%. What is important to note is that 36 times in which the model would have predicted a positive value for the output, it should have been negative, which is something that could cost quite a lot to the copmany.
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10,
repeats=5, sampling = "down")
mod_lda_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="lda",
family="binomial",
metric = "Sens", #optimize sensitivity
maximize = TRUE, #maximize the metric
trControl= train_params)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 64 -none- numeric
## scaling 32 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 4 -none- call
## xNames 32 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 1 -none- list
lda.pred.b <- predict(mod_lda_fitbalance, newdata = TestData) #predict give me the probability i am looking for the the binomial answer
We can see that the situation is more balanced.
The sensitivity is 73%, the specificity is only 65% and the accuracy is also low, at 67%. The number of false positive, though, is only 20.
Steps for the aplication of the algorithm:
| Data set | Steps |
|---|---|
| Unbalanced data | 1) Fit the model. 2) Predict . 3) Confusion matrix . |
| Balanced data | 4) Fit the model. 5) Predict 6) Confusion matrix |
#Same division
set.seed(1234)
#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation
mod_qda_fit <- caret::train(RESPONSE ~ ., TrainData, method="qda",
family="binomial",trControl= train_params)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 64 -none- numeric
## scaling 2048 -none- numeric
## ldet 2 -none- numeric
## lev 2 -none- character
## N 1 -none- numeric
## call 4 -none- call
## xNames 32 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 1 -none- list
Here, it seems still that the majority of the false prediction are in the positive level, however they seem less than before.
The sensitivity is 54%, the specificity is high, reaching almost 84%, and the accuracy is 75%. The number of false positive, however, is 34.
As we predicted, the model performs a little bit worse than the LDA, but for the sensitivity, which is the highest up to now (over 50%), is still quite low, though. The specificity is moderately high (above 80%) as it is the accuracy (above 70%). As we want to have a value for the false positive low, the 34 here is still quite high.
qda.pred.b <- predict(mod_qda_fitbalance, newdata = TestData)
We can see that there is still some unbalance towards the positive value.
The sensitivity is 65%, the specificity is 71% and the accuracy almost 70&, while the false positive are 26.
Steps for the application of the algorithm:
| Data set | Steps |
|---|---|
| Unbalanced data | 1) Fit the model. 2) Predict . 3) Confusion matrix . |
| Balanced data | 4) Fit the model. 5) Predict 6) Confusion matrix |
#Same division
set.seed(1234)
#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation
library(earth)
mod_fda_fit <- caret::train(RESPONSE ~ ., TrainData, method="fda",
trControl= train_params)
## Flexible Discriminant Analysis
##
## 750 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ...
## Resampling results across tuning parameters:
##
## nprune Accuracy Kappa
## 2 0.7000199 0.0000000
## 13 0.7349707 0.3090363
## 25 0.7477476 0.3574288
##
## Tuning parameter 'degree' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 1 and nprune = 25.
fda.pred <- predict(mod_fda_fit, newdata = TestData)
The unbalance towards the positive value is more than clear in this graph.
This model has a sensitivity of almost 55%, among one of the highest up to now, and the specificity is higher than 85%. The accuracy is around 76%. The false positive observations are 34.
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10,
repeats=5, sampling = "down")
mod_fda_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="fda",
metric = "Sens", #optimize sensitivity
maximize = TRUE,
trControl= train_params)
## Length Class Mode
## percent.explained 1 -none- numeric
## values 1 -none- numeric
## means 2 -none- numeric
## theta.mod 1 -none- numeric
## dimension 1 -none- numeric
## prior 2 table numeric
## fit 29 earth list
## call 7 -none- call
## terms 3 terms call
## confusion 4 table numeric
## xNames 32 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
Using the model we get the predictions for the RESPONSE variable and we can construct the confidence matrix for this case.
fda.pred.b <- predict(mod_fda_fitbalance, newdata = TestData) #predict give me the probability i am looking for the the binomial answer
We can see that the situation is more balanced.
Here, the sensitivity is almost 79%, while the specificity is 66%, with an accuracy of almost 70%. The false positive are really low, reaching 16 observations.
Steps for the application of the algorithm:
| Data set | Steps |
|---|---|
| Unbalanced data | 1) Fit the model. 2) Predict . 3) Confusion matrix . |
| Balanced data | 4) Fit the model. 5) Predict 6) Confusion matrix |
#Same division
set.seed(1234)
#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation
mod_mda_fit <- caret::train(RESPONSE ~ ., TrainData, method="mda",
family="binomial",trControl= train_params)
## Length Class Mode
## percent.explained 3 -none- numeric
## values 3 -none- numeric
## means 12 -none- numeric
## theta.mod 9 -none- numeric
## dimension 1 -none- numeric
## sub.prior 2 -none- list
## fit 5 polyreg list
## call 5 -none- call
## weights 2 -none- list
## prior 2 table numeric
## assign.theta 2 -none- list
## deviance 1 -none- numeric
## confusion 4 table numeric
## terms 3 terms call
## xNames 32 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 1 -none- list
mda.pred <- predict(mod_mda_fit, newdata = TestData)
The unbalance towards the positive value is more than clear in this graph.
The sensitivity in this case is around 55%, while the specificity is higher, reaching 85%. The accurary is around 75% and there are 34 false positive.
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10,
repeats=5, sampling = "down")
mod_mda_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="mda",
family="binomial",
metric = "Sens", #optimize sensitivity
maximize = TRUE,
trControl= train_params)
## Mixture Discriminant Analysis
##
## 750 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 675, 676, 675, 674, 676, 674, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## subclasses Accuracy Kappa
## 2 0.6984336 0.3549508
## 3 0.6993789 0.3545579
## 4 0.6829469 0.3139389
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was subclasses = 3.
mda.pred.b <- predict(mod_mda_fitbalance, newdata = TestData) #predict give me the probability i am looking for the the binomial answer
We have still some unbalance toward the positive value.
Here the sesntitivty is 72%, the specificity 74% and the accuracy 73%. The false positive are decreasing, with a value of 21.
Steps for the application of the algorithm:
| Data set | Steps |
|---|---|
| Unbalanced data | 1) Fit the model. 2)Checking Variables. 3) Predict . 4) Confusion matrix . |
| Balanced data | 5) Fit the model. 6) Predict 7) Confusion matrix |
#Same division
set.seed(1234)
#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation
mod_rf_fit <- caret::train(RESPONSE ~ ., TrainData, method="rf",
trControl= train_params)
## Random Forest
##
## 750 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7235177 0.1525749
## 17 0.7477162 0.3489012
## 32 0.7445090 0.3450546
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 17.
The summary of the model gives the decrease in accuracy and the decrease of the gini index for each variable in the model, along with the number of trees that are built (500 in our case), the number of variabes that are randomly chosen to be tried at each split before determining which one is the best one to describe the node. Moreover, we can already find the confusion matrix (we will show it better again afterwards to keep the coherence of the analysis throuhgout all the models), with the class errorand the Out-Of-Bag estimate of the error rate.
Let’s give some definitions to be clearer:
Variable importance is the mean decrease of accuracy over all out-of-bag cross validated predictions, when a given variable is permuted after training, but before prediction.
GINI importance measures the average gain of purity by splits of a given variable. If the variable is useful, it tends to split mixed labeled nodes into pure single class nodes. Splitting by a permuted variables tend neither to increase nor decrease node purities.
Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging) to sub-sample data samples used for training. OOB is the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample. source: https://en.wikipedia.org/wiki/Out-of-bag_error
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## CHK_ACCT 30.8377522 14.9310529 29.62471840 24.909137
## DURATION 3.0789831 17.2512610 16.97809183 23.722782
## HISTORY 10.6308613 10.2319846 14.95907324 16.225153
## PURPOSE 4.6632195 3.8556094 5.99145514 21.218288
## AMOUNT 2.7620518 12.2473068 13.09750874 37.191057
## SAV_ACCT 9.7424546 2.4221673 7.48429134 12.897821
## EMPLOYMENT 4.7202650 3.2030500 5.62544435 15.860447
## INSTALL_RATE -1.5100678 2.6360360 1.30398030 10.326397
## MALE_SINGLE 3.2165679 0.1469817 2.12884373 5.216763
## GUARANTOR 2.6848191 9.3310231 9.12005837 2.063821
## PROPERTY 0.8703037 4.2813758 4.00866662 7.959729
## OTHER_INSTALL 1.6732583 3.4119582 3.71946567 4.919703
## RESIDENCE -0.4792045 0.2264302 -0.09300405 6.525084
## NUM_CREDITS -1.0737951 4.8444649 3.40369292 5.752978
The most important variables appear to be CHK_ACC, DURATION and HISTORY in terms of Accuracy and AMOUNT, CHK_ACC and DURATION in terms of gini index, which is consistent with what we have found up to now.
The predictions shows a clear preference towards the positive value.
Here the sensitivity is around 46%, the specificity is high (more than 87%) and the accuracy is around 75%, while the number of false positive is 40 observations.
train_params <- caret::trainControl(method = "repeatedcv", number = 10,
repeats=5, sampling = "down")
mod_rf_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="rf",
family="binomial",
metric = "Sens", #optimize sensitivity
maximize = TRUE,
trControl= train_params)
## Random Forest
##
## 750 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 674, 676, 675, 675, 675, 675, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6779684 0.3442597
## 17 0.6877403 0.3509316
## 32 0.6842658 0.3449484
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 17.
rf.pred.b <- predict(mod_rf_fitbalance, newdata = TestData)
We can see that the predictions are more balanced.
Here the sensitivity is 78%, the specificity is 65% and the accuracy almost 70%. However the false positive are really low, being only 16 cases.
Steps for the application of the algorithm:
| Data set | Steps |
|---|---|
| Unbalanced data | 1) Fit the model. 2) Plot 3) Predict . 4) Confusion matrix . |
| Balanced data | 5) Fit the model. 6) Plot 7) Predict 8) Confusion matrix |
#Same division
set.seed(1234)
#########################model######################################
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5)
mod_nn_fit <- caret::train(RESPONSE ~ ., TrainData, method="nnet",
trControl= train_params)
## Neural Network
##
## 750 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.7050009 0.3312729
## 1 1e-04 0.7093730 0.3370020
## 1 1e-01 0.7509943 0.3830480
## 3 0e+00 0.7003229 0.2899405
## 3 1e-04 0.7104328 0.2919372
## 3 1e-01 0.7254274 0.3296829
## 5 0e+00 0.6965142 0.2769069
## 5 1e-04 0.7045505 0.2925045
## 5 1e-01 0.7069400 0.2912240
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 1 and decay = 0.1.
nn.pred <- predict(mod_nn_fit, newdata = TestData)
We can see an unblanced result toward the positive value.
Here, the sensitivity is 53%, but the specificity is almost 88%, with an accuracy of 77%. The false positive, however, are still 35.
train_params <- caret::trainControl(method = "repeatedcv", number = 10,
repeats=5, sampling = "down")
mod_nn_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="nnet",
family="binomial",
metric = "Sens", #optimize sensitivity
maximize = TRUE,
trControl= train_params)
## Neural Network
##
## 750 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 675, 675, 675, 675, 676, 675, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.6668277 0.2984470
## 1 1e-04 0.6847165 0.3275770
## 1 1e-01 0.6918545 0.3515522
## 3 0e+00 0.6758101 0.2945431
## 3 1e-04 0.6611416 0.2806308
## 3 1e-01 0.6741378 0.3118398
## 5 0e+00 0.6700138 0.2968991
## 5 1e-04 0.6525221 0.2713653
## 5 1e-01 0.6648893 0.3005763
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 1 and decay = 0.1.
Here we see a sesitivity of 77%, a really high specificity among the ones we have found (61%) and an accuracy of only 66%. However, the false positive are quite low, being only 17.
Steps for the application of the algorithm:
| Data set | Steps |
|---|---|
| Unbalanced data | 1) Fit the model. 2) Plot 3) Predict . 3) Confusion matrix . |
| Balanced data | 4) Fit the model. 5) Predict 6) Confusion matrix |
######################### transform data ############
data_xgboost <- purrr::map_df(data_scale, function(columna) {
columna %>%
as.factor() %>%
as.numeric %>%
{ . - 1 } })
test_xgboost <- sample_frac(data_xgboost, size = 0.249)
train_xgboost <- setdiff(data_xgboost, test_xgboost)
#Convertir a DMatrix
train_xgb_matrix <- train_xgboost %>%
dplyr::select(- RESPONSE) %>%
as.matrix() %>%
xgboost::xgb.DMatrix(data = ., label = train_xgboost$RESPONSE)
#Convertir a DMatrix
test_xgb_matrix <- test_xgboost %>%
dplyr::select(- RESPONSE) %>%
as.matrix() %>%
xgboost::xgb.DMatrix(data = ., label = test_xgboost$RESPONSE)
#Same division
set.seed(1234)
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv",
number = 10, # with n folds
repeats=5) #K-Fold Cross Validation
mod_xgb_fit <- caret::train(RESPONSE ~ ., TrainData,
method="xgbTree",
trControl= train_params)
## eXtreme Gradient Boosting
##
## 750 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds Accuracy Kappa
## 0.3 1 0.6 0.50 50 0.7418943 0.3130889
## 0.3 1 0.6 0.50 100 0.7510080 0.3596714
## 0.3 1 0.6 0.50 150 0.7485370 0.3576183
## 0.3 1 0.6 0.75 50 0.7448313 0.3088400
## 0.3 1 0.6 0.75 100 0.7555702 0.3640103
## 0.3 1 0.6 0.75 150 0.7596064 0.3826587
## 0.3 1 0.6 1.00 50 0.7320230 0.2517796
## 0.3 1 0.6 1.00 100 0.7472031 0.3250098
## 0.3 1 0.6 1.00 150 0.7541726 0.3577772
## 0.3 1 0.8 0.50 50 0.7491559 0.3330483
## 0.3 1 0.8 0.50 100 0.7499660 0.3528052
## 0.3 1 0.8 0.50 150 0.7560643 0.3778748
## 0.3 1 0.8 0.75 50 0.7408132 0.2968492
## 0.3 1 0.8 0.75 100 0.7528323 0.3628258
## 0.3 1 0.8 0.75 150 0.7544858 0.3723502
## 0.3 1 0.8 1.00 50 0.7272474 0.2373587
## 0.3 1 0.8 1.00 100 0.7482700 0.3303301
## 0.3 1 0.8 1.00 150 0.7528463 0.3540888
## 0.3 2 0.6 0.50 50 0.7488286 0.3516789
## 0.3 2 0.6 0.50 100 0.7522569 0.3781276
## 0.3 2 0.6 0.50 150 0.7525873 0.3773257
## 0.3 2 0.6 0.75 50 0.7563457 0.3724741
## 0.3 2 0.6 0.75 100 0.7560932 0.3862271
## 0.3 2 0.6 0.75 150 0.7558121 0.3935367
## 0.3 2 0.6 1.00 50 0.7507128 0.3463173
## 0.3 2 0.6 1.00 100 0.7590378 0.3887471
## 0.3 2 0.6 1.00 150 0.7593079 0.3971011
## 0.3 2 0.8 0.50 50 0.7539065 0.3696032
## 0.3 2 0.8 0.50 100 0.7509478 0.3686413
## 0.3 2 0.8 0.50 150 0.7469724 0.3692421
## 0.3 2 0.8 0.75 50 0.7488107 0.3507391
## 0.3 2 0.8 0.75 100 0.7531630 0.3804305
## 0.3 2 0.8 0.75 150 0.7531277 0.3839775
## 0.3 2 0.8 1.00 50 0.7536859 0.3597987
## 0.3 2 0.8 1.00 100 0.7579463 0.3883921
## 0.3 2 0.8 1.00 150 0.7566413 0.3884018
## 0.3 3 0.6 0.50 50 0.7445868 0.3541288
## 0.3 3 0.6 0.50 100 0.7398466 0.3546145
## 0.3 3 0.6 0.50 150 0.7304378 0.3286955
## 0.3 3 0.6 0.75 50 0.7483097 0.3643967
## 0.3 3 0.6 0.75 100 0.7441073 0.3627462
## 0.3 3 0.6 0.75 150 0.7435769 0.3611827
## 0.3 3 0.6 1.00 50 0.7464711 0.3509138
## 0.3 3 0.6 1.00 100 0.7493550 0.3673045
## 0.3 3 0.6 1.00 150 0.7490631 0.3703037
## 0.3 3 0.8 0.50 50 0.7419980 0.3507760
## 0.3 3 0.8 0.50 100 0.7421971 0.3551587
## 0.3 3 0.8 0.50 150 0.7355441 0.3447870
## 0.3 3 0.8 0.75 50 0.7482851 0.3599911
## 0.3 3 0.8 0.75 100 0.7429689 0.3586103
## 0.3 3 0.8 0.75 150 0.7386764 0.3455336
## 0.3 3 0.8 1.00 50 0.7470363 0.3559097
## 0.3 3 0.8 1.00 100 0.7499492 0.3705114
## 0.3 3 0.8 1.00 150 0.7445618 0.3602553
## 0.4 1 0.6 0.50 50 0.7504645 0.3545303
## 0.4 1 0.6 0.50 100 0.7542409 0.3709721
## 0.4 1 0.6 0.50 150 0.7481281 0.3627715
## 0.4 1 0.6 0.75 50 0.7501758 0.3327702
## 0.4 1 0.6 0.75 100 0.7518508 0.3604920
## 0.4 1 0.6 0.75 150 0.7579531 0.3831873
## 0.4 1 0.6 1.00 50 0.7410695 0.2950260
## 0.4 1 0.6 1.00 100 0.7523093 0.3525505
## 0.4 1 0.6 1.00 150 0.7528890 0.3598949
## 0.4 1 0.8 0.50 50 0.7501871 0.3505324
## 0.4 1 0.8 0.50 100 0.7520644 0.3668550
## 0.4 1 0.8 0.50 150 0.7553003 0.3747199
## 0.4 1 0.8 0.75 50 0.7502080 0.3455585
## 0.4 1 0.8 0.75 100 0.7534442 0.3653316
## 0.4 1 0.8 0.75 150 0.7561533 0.3788979
## 0.4 1 0.8 1.00 50 0.7426767 0.2984313
## 0.4 1 0.8 1.00 100 0.7504569 0.3466898
## 0.4 1 0.8 1.00 150 0.7536928 0.3621781
## 0.4 2 0.6 0.50 50 0.7523672 0.3749669
## 0.4 2 0.6 0.50 100 0.7520434 0.3820442
## 0.4 2 0.6 0.50 150 0.7477865 0.3706605
## 0.4 2 0.6 0.75 50 0.7552045 0.3748880
## 0.4 2 0.6 0.75 100 0.7544863 0.3892710
## 0.4 2 0.6 0.75 150 0.7528725 0.3849105
## 0.4 2 0.6 1.00 50 0.7563633 0.3736177
## 0.4 2 0.6 1.00 100 0.7603679 0.3941108
## 0.4 2 0.6 1.00 150 0.7544439 0.3840109
## 0.4 2 0.8 0.50 50 0.7501771 0.3682866
## 0.4 2 0.8 0.50 100 0.7448568 0.3668428
## 0.4 2 0.8 0.50 150 0.7442241 0.3681954
## 0.4 2 0.8 0.75 50 0.7547566 0.3807891
## 0.4 2 0.8 0.75 100 0.7539527 0.3855215
## 0.4 2 0.8 0.75 150 0.7480716 0.3766680
## 0.4 2 0.8 1.00 50 0.7520894 0.3618823
## 0.4 2 0.8 1.00 100 0.7507953 0.3722940
## 0.4 2 0.8 1.00 150 0.7520862 0.3768357
## 0.4 3 0.6 0.50 50 0.7410843 0.3569199
## 0.4 3 0.6 0.50 100 0.7312583 0.3342736
## 0.4 3 0.6 0.50 150 0.7286133 0.3343768
## 0.4 3 0.6 0.75 50 0.7403550 0.3484262
## 0.4 3 0.6 0.75 100 0.7390144 0.3507854
## 0.4 3 0.6 0.75 150 0.7403478 0.3531362
## 0.4 3 0.6 1.00 50 0.7493269 0.3705657
## 0.4 3 0.6 1.00 100 0.7456078 0.3664372
## 0.4 3 0.6 1.00 150 0.7383996 0.3517229
## 0.4 3 0.8 0.50 50 0.7415005 0.3566296
## 0.4 3 0.8 0.50 100 0.7353452 0.3471341
## 0.4 3 0.8 0.50 150 0.7305121 0.3298686
## 0.4 3 0.8 0.75 50 0.7472678 0.3672087
## 0.4 3 0.8 0.75 100 0.7414042 0.3543595
## 0.4 3 0.8 0.75 150 0.7365968 0.3419264
## 0.4 3 0.8 1.00 50 0.7488359 0.3643174
## 0.4 3 0.8 1.00 100 0.7471971 0.3686439
## 0.4 3 0.8 1.00 150 0.7423783 0.3605316
##
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 2, eta
## = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
## = 1.
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 80 100 2 0.4 0 0.6 1 1
xgb.pred <- predict(mod_xgb_fit, newdata = TestData)
We can see that there is a higher probability of predicting a positive value.
Here the sensitivity is quite low, 53%, the specificity is at 87% though and the accuracy at 77%. The number of false positive is high, being 35.
train_params <- caret::trainControl(method = "repeatedcv", number = 10,
repeats=5, sampling = "down")
mod_xgb_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="xgbTree",
metric = "Sens", #optimize sensitivity
maximize = TRUE,
trControl= train_params)
## eXtreme Gradient Boosting
##
## 750 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 676, 674, 675, 675, 674, 676, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds Accuracy Kappa
## 0.3 1 0.6 0.50 50 0.6989449 0.3682164
## 0.3 1 0.6 0.50 100 0.7037523 0.3721827
## 0.3 1 0.6 0.50 150 0.7004950 0.3646692
## 0.3 1 0.6 0.75 50 0.7029369 0.3805843
## 0.3 1 0.6 0.75 100 0.7165677 0.3981571
## 0.3 1 0.6 0.75 150 0.7106612 0.3888852
## 0.3 1 0.6 1.00 50 0.6840729 0.3469812
## 0.3 1 0.6 1.00 100 0.7053843 0.3809739
## 0.3 1 0.6 1.00 150 0.7069175 0.3800252
## 0.3 1 0.8 0.50 50 0.7037528 0.3783023
## 0.3 1 0.8 0.50 100 0.7028241 0.3746458
## 0.3 1 0.8 0.50 150 0.7132782 0.3853881
## 0.3 1 0.8 0.75 50 0.6967969 0.3702232
## 0.3 1 0.8 0.75 100 0.7098792 0.3887799
## 0.3 1 0.8 0.75 150 0.7078024 0.3804073
## 0.3 1 0.8 1.00 50 0.6882519 0.3537514
## 0.3 1 0.8 1.00 100 0.7034681 0.3748476
## 0.3 1 0.8 1.00 150 0.7120201 0.3883081
## 0.3 2 0.6 0.50 50 0.6860122 0.3469735
## 0.3 2 0.6 0.50 100 0.6897958 0.3476942
## 0.3 2 0.6 0.50 150 0.6860405 0.3380399
## 0.3 2 0.6 0.75 50 0.6972873 0.3615264
## 0.3 2 0.6 0.75 100 0.6994387 0.3624071
## 0.3 2 0.6 0.75 150 0.7014908 0.3687918
## 0.3 2 0.6 1.00 50 0.6983691 0.3650248
## 0.3 2 0.6 1.00 100 0.7021097 0.3672225
## 0.3 2 0.6 1.00 150 0.7026747 0.3657473
## 0.3 2 0.8 0.50 50 0.7053455 0.3783005
## 0.3 2 0.8 0.50 100 0.7066934 0.3762670
## 0.3 2 0.8 0.50 150 0.6952503 0.3555109
## 0.3 2 0.8 0.75 50 0.6999555 0.3670906
## 0.3 2 0.8 0.75 100 0.7055447 0.3728668
## 0.3 2 0.8 0.75 150 0.7020037 0.3671155
## 0.3 2 0.8 1.00 50 0.6975971 0.3673505
## 0.3 2 0.8 1.00 100 0.7020708 0.3749196
## 0.3 2 0.8 1.00 150 0.7023695 0.3696903
## 0.3 3 0.6 0.50 50 0.6858627 0.3369000
## 0.3 3 0.6 0.50 100 0.6827122 0.3316641
## 0.3 3 0.6 0.50 150 0.6813246 0.3282858
## 0.3 3 0.6 0.75 50 0.6946494 0.3580541
## 0.3 3 0.6 0.75 100 0.6918126 0.3483333
## 0.3 3 0.6 0.75 150 0.6827554 0.3265486
## 0.3 3 0.6 1.00 50 0.6921145 0.3542610
## 0.3 3 0.6 1.00 100 0.6878119 0.3420561
## 0.3 3 0.6 1.00 150 0.6872252 0.3386723
## 0.3 3 0.8 0.50 50 0.6855642 0.3367933
## 0.3 3 0.8 0.50 100 0.6833990 0.3318456
## 0.3 3 0.8 0.50 150 0.6822781 0.3289131
## 0.3 3 0.8 0.75 50 0.7023904 0.3765114
## 0.3 3 0.8 0.75 100 0.6975789 0.3617480
## 0.3 3 0.8 0.75 150 0.6970494 0.3584640
## 0.3 3 0.8 1.00 50 0.6905000 0.3486176
## 0.3 3 0.8 1.00 100 0.6843731 0.3336077
## 0.3 3 0.8 1.00 150 0.6830252 0.3316754
## 0.4 1 0.6 0.50 50 0.7013589 0.3753652
## 0.4 1 0.6 0.50 100 0.7050826 0.3762892
## 0.4 1 0.6 0.50 150 0.7077956 0.3790251
## 0.4 1 0.6 0.75 50 0.7098295 0.3885280
## 0.4 1 0.6 0.75 100 0.7176034 0.4011933
## 0.4 1 0.6 0.75 150 0.7112060 0.3866502
## 0.4 1 0.6 1.00 50 0.7010636 0.3755244
## 0.4 1 0.6 1.00 100 0.7072049 0.3823603
## 0.4 1 0.6 1.00 150 0.7087556 0.3825659
## 0.4 1 0.8 0.50 50 0.7012810 0.3717004
## 0.4 1 0.8 0.50 100 0.7071444 0.3772126
## 0.4 1 0.8 0.50 150 0.7093061 0.3793106
## 0.4 1 0.8 0.75 50 0.6959931 0.3608162
## 0.4 1 0.8 0.75 100 0.7074396 0.3736301
## 0.4 1 0.8 0.75 150 0.7071586 0.3749501
## 0.4 1 0.8 1.00 50 0.7015932 0.3757983
## 0.4 1 0.8 1.00 100 0.7090826 0.3856597
## 0.4 1 0.8 1.00 150 0.7138514 0.3922263
## 0.4 2 0.6 0.50 50 0.6933618 0.3484379
## 0.4 2 0.6 0.50 100 0.6861285 0.3390571
## 0.4 2 0.6 0.50 150 0.6970958 0.3583372
## 0.4 2 0.6 0.75 50 0.7036958 0.3740340
## 0.4 2 0.6 0.75 100 0.7098623 0.3868494
## 0.4 2 0.6 0.75 150 0.7077277 0.3820773
## 0.4 2 0.6 1.00 50 0.6956986 0.3609303
## 0.4 2 0.6 1.00 100 0.6983264 0.3611671
## 0.4 2 0.6 1.00 150 0.6986006 0.3601811
## 0.4 2 0.8 0.50 50 0.6940845 0.3465297
## 0.4 2 0.8 0.50 100 0.6864174 0.3347040
## 0.4 2 0.8 0.50 150 0.6898281 0.3386950
## 0.4 2 0.8 0.75 50 0.7018429 0.3691133
## 0.4 2 0.8 0.75 100 0.7071163 0.3801981
## 0.4 2 0.8 0.75 150 0.6991079 0.3635405
## 0.4 2 0.8 1.00 50 0.6983864 0.3634779
## 0.4 2 0.8 1.00 100 0.7013200 0.3703492
## 0.4 2 0.8 1.00 150 0.6983653 0.3626481
## 0.4 3 0.6 0.50 50 0.6770544 0.3220538
## 0.4 3 0.6 0.50 100 0.6766991 0.3228922
## 0.4 3 0.6 0.50 150 0.6766571 0.3192328
## 0.4 3 0.6 0.75 50 0.6893471 0.3460660
## 0.4 3 0.6 0.75 100 0.6959780 0.3568683
## 0.4 3 0.6 0.75 150 0.6938557 0.3516515
## 0.4 3 0.6 1.00 50 0.7043037 0.3790669
## 0.4 3 0.6 1.00 100 0.6983864 0.3659641
## 0.4 3 0.6 1.00 150 0.6951825 0.3571023
## 0.4 3 0.8 0.50 50 0.6816172 0.3291605
## 0.4 3 0.8 0.50 100 0.6833557 0.3306590
## 0.4 3 0.8 0.50 150 0.6721472 0.3044708
## 0.4 3 0.8 0.75 50 0.6895922 0.3440636
## 0.4 3 0.8 0.75 100 0.6788927 0.3243204
## 0.4 3 0.8 0.75 150 0.6828611 0.3261296
## 0.4 3 0.8 1.00 50 0.7002067 0.3621206
## 0.4 3 0.8 1.00 100 0.6957688 0.3551191
## 0.4 3 0.8 1.00 150 0.6938985 0.3515336
##
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 1, eta
## = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
## = 0.75.
xgb.pred.b <- predict(mod_xgb_fitbalance, newdata = TestData)
We can see that it is indeed a bit more balanced, however there is still a majoritiy of positive values predicted.
The sensitivity is almost 67%, the specificity is at 70% and the accuracy is almost 70%. The number of false positive is still at 25.
For this section we will measure the main parameters of the six analyzed models.
| Sensitivity | Specificity | Accuracy | |
|---|---|---|---|
| logistic | 0.5200000 | 0.8965517 | 0.7831325 |
| logistic_balance | 0.7200000 | 0.6436782 | 0.6666667 |
| decision_tree | 0.3066667 | 0.9252874 | 0.7389558 |
| decision_tree_balance | 0.7066667 | 0.5919540 | 0.6265060 |
| lda | 0.5200000 | 0.9022989 | 0.7871486 |
| lda_balance | 0.7333333 | 0.6954023 | 0.7068273 |
| qda | 0.5466667 | 0.8390805 | 0.7510040 |
| qda_balance | 0.6533333 | 0.7183908 | 0.6987952 |
| fda | 0.5466667 | 0.8505747 | 0.7590361 |
| fda_balance | 0.7866667 | 0.6609195 | 0.6987952 |
| mda | 0.5466667 | 0.8505747 | 0.7590361 |
| mda_balance | 0.7200000 | 0.7413793 | 0.7349398 |
| rf | 0.4666667 | 0.8735632 | 0.7510040 |
| rf_balance | 0.7866667 | 0.6551724 | 0.6947791 |
| nn | 0.5333333 | 0.8793103 | 0.7751004 |
| nn_balance | 0.7733333 | 0.6149425 | 0.6626506 |
| xgb | 0.5333333 | 0.8735632 | 0.7710843 |
| xgb_balance | 0.6666667 | 0.7011494 | 0.6907631 |
We can see that in temrs of sensitivity, the best models are the balanced FDA, Random Forest and Neural Networks, in terms of specificity the unbalanced decision tree, LDA and logistic regression (we were expecting the unbalanced version to perform better in terms of specificity as we have a majority of positive values in the predictions, hence there will be more positive value predicted which will increase the value of the specificity), while in terms of accuracy the unbalanced LDA, logistic and Random Forest (same reasoning as for the specificity). We were quite surprised by the results we have found, we were expecting the XGB to perform better then the random forest, but we can see that it is actually giving lower values both for specificity and accuracy in the unbalanced case, while the balanced case is only worse in terms of sensitivity compared to the balanced random forest.
More details on the evaluation of the models in the next chapter.
In this chapter we will assesses the degree to which the model we have chosen meets the business objectives and we will try to determine if there is some business reason why this model is deficient.
The process will be to compare the results with the evaluation criteria we determined in chapter 3.
The business goal of this analysis was to determine whether a client was at risk of not being able to pay back the credit that has been granted to them, as it would mean a loss for the company and the shareholders.
We will determine it by considering that the company will grant a credit only to those who a have a good credit score, which is those who have the response variable positive, and not giving it to those who have a response of zero.
In order to do it, we will have a look at the quantity of false positive that the model generated, as they would be the people to which a credit has been granted but that would not be able to pay back the company, and we will try to make an evaluation of the potential losses that the firm could make in using the specific model, which should be lower than the 10% of the total amount of credits that the company would be willing to accept.
The ones we will look at in the specific are the balanced versions of the neural network, random forest and the xgboot, as they were the ones having the highest performance in terms of all the parameters we are looking at: specificity, sensitivity, accuracy, the others had good values in one, but performed poorly in the other parameters.
RF <- confusionMatrix(as.factor(rf.pred.b), as.factor(TestData$RESPONSE))$table[2,1]
NN <- confusionMatrix(as.factor(nn.pred.b), as.factor(TestData$RESPONSE))$table[2,1]
XGB <- confusionMatrix(as.factor(xgb.pred.b), as.factor(TestData$RESPONSE))$table[2,1]
FP <- data.frame(t(data.frame(RF, NN, XGB)))
names(FP) <- c("False Positive")
FP
## False Positive
## RF 16
## NN 17
## XGB 25
The table shows the number of false positive instances in the predictions given by each models. As we can see, the lowest value belongs to the random forest, and it’s equal to 16. This means that at least in 16 cases, the model would falsly predict a person belonging to the category that should have a credit granted, while it should not. These cases are risky for the company, as they could result in a default in the payback of the credit and hence in a loss of the company.
However, the models are still quite satisfying, as the false positive are only a low percentage compared to the number of observations that are tested, you can find the values in the following tables.
FP %<>% dplyr::mutate(Model = c("RF", "NN", "XGB"),
FP_Perc = (FP[,1]/nrow(TestData))) %>% dplyr::select("Model", everything())
FP
## Model False Positive FP_Perc
## 1 RF 16 0.06425703
## 2 NN 17 0.06827309
## 3 XGB 25 0.10040161
We can see that the 3 models we have chosen have a percentage of false positive that is lower than 10%. However, the test set is quite small, hence we should repeat the testing with more data to make sure that the values are kept this low.
We can calculate the maximum losses that could happen if all the people that belongs to the false positive group will not actually pay back the credit they have been granted.
amount <- data_sel[-val_index,]$AMOUNT #get the amount from the unscaled data corresponding to the test set
fp.rf <- (ifelse(rf.pred.b == 1 & TestData$RESPONSE == 0, 1, 0)) #selecting the false positive observations
losses.rf <- sum(fp.rf * amount) #calculating the losses
fp.nn <- (ifelse(nn.pred.b == 1 & TestData$RESPONSE == 0, 1, 0)) #selecting the false positive observations
losses.nn <- sum(fp.nn * amount) #calculating the losses
fp.xgb <- (ifelse(xgb.pred.b == 1 & TestData$RESPONSE == 0, 1, 0)) #selecting the false positive observations
losses.xgb <- sum(fp.xgb * amount) #calculating the losses
Losses <- data.frame(losses.rf, losses.nn, losses.xgb) #create a df
Losses <- data.frame(t(Losses)) #transpose df
names(Losses) <- "Losses" #naming the cols of the df
Losses %<>% dplyr::mutate(Model = c("RF", "NN", "XGB")) %>% dplyr::select(Model, Losses)
Losses
## Model Losses
## 1 RF 47380
## 2 NN 56671
## 3 XGB 99591
As we can see, the amounts ranges from 9.959110^{4} to 4.73810^{4}. The random forest model performs the best both in terms of predicting the false positives (the percentage is lower compared to the one of the xgb and nn) and it has the lowest value for the losses. This means that the it probably puts a higher importance on the variable of amount to predict the category of a new person, and tries to minimize the losses as much as possible. It should hence be preferred.
We want to determine whether these losses represent a high percentage of the total amount of credit that would be granted to the people belonging to the test set.
sel <- data_sel[-val_index,] #getting the observations unscaled
pos <- sel %>% dplyr::filter(RESPONSE == 1) %>% dplyr::select(AMOUNT) #selecting only the amount of the credits that are granted
Losses %<>% dplyr::mutate(Losses_Perc = Losses / sum(pos))
Losses
## Model Losses Losses_Perc
## 1 RF 47380 0.08826724
## 2 NN 56671 0.10557604
## 3 XGB 99591 0.18553446
As we can see, the model that as the lowest percentage is the random forest and it does meet our criteria for the selection of the model. i.e.: having the losses lower than the 10% of the total amount of the credits that would be granted.
However, we can also say that the percentage of the losses given by the neural network are exceeding the threshold by less than 1%, hence it could be discussed to also use one of these models if it would mean a lower cost for the company in terms of complexity and computation time. This applies only for the neural network and not for the XGB, not only it performed worse, but also was taking quite some time to be fitted. What is more, the random forest allows for a higher degree of interpretation, while the neural network is more used as a black box.
cbind(FP, Losses[,-1])
## Model False Positive FP_Perc Losses Losses_Perc
## 1 RF 16 0.06425703 47380 0.08826724
## 2 NN 17 0.06827309 56671 0.10557604
## 3 XGB 25 0.10040161 99591 0.18553446
We would hence suggest to use a random forest model, as it has among the highest sensitivity, lowest amount of false positive predictions and lowest percentage of losses, equal to 0.0882672, while having also a higher degree of interpretability and lower complexity, compared to the other methods that were selected at the end of our modelization chapter.
Moreover, we have seen that not all the variables that are included in the dataset are actually useful for the prediction of the response. This means that the company, when evaluating a new customer, should rather focus on getting the information regarding the variables that have been selected, namely CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS, TELEPHONE. This would mean lower costs for the company, as they would spend less time on getting useless information and less space to store them.
We started our data mining with an exploratory data analysis. We looked at the structure of the dataset that we used, which had 32 variables and 1’000 observations. Then, we had a more detailed look at the output variable and we could conclude that we had a binary with a majority of positive instances. Looking at the independent variables, we could see that the continuous ones were skewed and had different scales. We could also identify some errors in the data that were fixed, while no missing values were founded. In the second part of our EDA we built a few category variables, so that we could diminish the number of variables that we needed to use in the modelling, more specifically we built a binary describing the sex of the person, one categorical for the purpose of the credit, one categorical for the property and another one for the residence. To assess if it made sense to aggregate the variables, chi-squared test were run. To further select the data, we created a simple linear regression and we used the AIC to select only the most significant ones. We were then able to move on to the modelling part, in which we used 6 different models, namely: logistic regression, decision trees, discriminate analysis, random forest, neural network and xgboost. For each of them we fit a model on the unbalanced training set (containing 75% of the data randomly selected) and then compared the predictions it gave to the test set (containing the remaining 25% of the data). We did also a balancing of the dataset, in order to have around the same amount of the positive and negative values for the response, and we fit the same models on the training set based on this data, built in the same way as before, and compared the predictions to the test set. For each model we built a confusion matrix and we took into consideration the accuracy, specificity, and sensitivity, with a higher weight put on the latter, which allowed us to select the models for the evaluation part, in which we considered the false positive amount in the prediction and the losses that would have been associated. The result was a selection of the random forest model, which outperformed all the other ones.
We believe that what has been done was an accurate analysis of the data, however some improvements could be done in terms of process performance. More specifically, we could see that the variable created describing the sex of the person was not selected, hence it was not necessary to create it. Moreover, the correlations were calculated but not really used for the selection of the variables, they could have been avoided too. We also believe that the coding could have been executed in a more efficient way, as a lot of repetitions were done, specifically in the modelling part. We could have created either a function for the modelling and use is to diminish the lines of code, or find another way to optimize it, e.g. the use of a different library. However, thanks to the caret package, we were already able to optimize a good part of the code, which would have been even longer and more complicated otherwise. What is more, we could have included different models, as some of the ones we used were elementary and were expected to perform poorly compared to more complex ones, such as the neural network or the random forest. We could have chosen one simple model in order to compare the results and see if the increase in accuracy, sensitivy and specificity was high enough to excuse the increase in complexity, and then only keep the most performing ones and select some others.
In any case, the results we have found are quite satisfying, as we could still find two models that are giving a prediction that is meeting (or almost) our business success criteria.
To improve the process, another model could be selected, maybe one that has not been considered in our analysis. However, we believe that the results that will be given are already satisfying enough.
Another way to improve the model could be to considered other information that has not been considered in our analysis, such as the number of other credits that are pending or the history of (un)repaid credits.
An alternative way could be to gather other information from other credit companies, banks, insurances, etc., so that it is possible to fit a more powerful model.
With our analysis, the company should be able to assess the quality of a new customer and predict if it should be a good idea to give them a credit or not.
We believe that the company should follow these steps, each time a new customer approaches the firm from now on: 1. Collect the information only regarding the variables that have been selected, namely CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS, TELEPHONE 2. With the information gathered, run a random forest model prediction and determine whether the credit should be granted or not 3. Store the result of the decision 4. In case the credit was given, wait and see if it will be paid back 5. Store the result of the debt settlement 6. Use the new data to fit an upgraded model
Goal: Losses < 10% amount of credit
According to the NASA Technical Reports, human error has been reported as being responsible for 60%‐80% of failure, which means that an automation of the selection process would reduce the error and the risk of the not repayment of the loans. However, the application of this tool must co-exist with experienced staff, because there are some factors that must be taken into account, for example the verification of the documents provided for the application. In addition, there would be an improvement in response times and a decrease of the workload for the staff.
At the beginning of the project, we asked ourselfs some questions that would have helped us meet our main objective, through which we will provide our conclusions.
Are there any variables that could be grouped?
Yes, we have tested the independence of some variables and we have grouped them. In this case we created some dummy variables, for instance the purpose of the credit variable have 6 differents levels.
Have we used all the original independent variables of the model?
No, at the end we select the 15 variables which brings the most of the information to the model, they are: CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS, TELEPHONE
Is the data balanced regarding the answer variable?
No, the data is not balance. At the beginning of the model we detect a greater inclination for the prediction of positives, which means that the data is biased. To correct it, we changed the parameters for the training of each model in order to balance the data and maximize the sensitivity.
Does it make sense to balance the data to avoid the model being biased?
Yes, the new solutions with balanced data, in general, showed greater accuracy and sensitivity.
Accuracy, sensivity or specificity, which we need to be more focus on?
In our case, we focused on maximizing the sensitivity, the positive prediction ratio is key to avoid the prediction of a false positive, this could increases the risk of giving the credit to a client that cannot afford the repayments, therefore, the bank could not collect the interest.
Which model fit better?
As we mention earlier in the evaluation section, we decided to go for the random forest model. It is the model that best manages the trade-off between the accuracy and sensitivity and with the lowest percentage and value of losses.
*Classification and Regression Training: CARET R documentation https://www.rdocumentation.org/packages/caret/versions/6.0-86*
*Xie, A. Y. J. J. (2020, April 26). R Markdown: The Definitive Guide. Retrieved from https://bookdown.org/yihui/rmarkdown/*
*Xie, A. Y. J. J. (2020, April 26). R Markdown: Code Chunk chapter https://rmarkdown.rstudio.com/lesson-7.html*